Data Transformation

A tidyverse-inspired API for querying data in the browser with DuckDB-WASM and Svelte 5.

Introduction

If you've used R's tidyverse, you know the joy of piping data through a chain of verbs: filter(), mutate(), select(), arrange(), count(), summarize(). tidyduck brings that same fluency to the browser. Instead of tibbles and dplyr, you get parquet files and a reactive query builder powered by DuckDB-WASM and Svelte 5.

The goal of this chapter is to give you an overview of the key tools for transforming data in the browser. We'll start with functions that operate on rows, then on columns, and finally on groups. Along the way, we'll compare the tidyverse equivalent so you can map your existing knowledge to this new API.

Prerequisites

Install the add-on into any SvelteKit project:

npx sv add @the-vcsi/tidyduck

This drops three files into src/lib/db/:

FilePurpose
duckdb.svelte.tsDuckDB-WASM singleton — lazy-loads the engine on first query
sql.svelte.tsReactive primitives: duck(), duck_val(), duck_col()
duck.svelte.tsQuery builder + composable SQL helpers

Place your data files in static/data/. For this guide, we use flights.parquet. Then import database() and register your tables — similar to Observable's DuckDBClient.of():

svelte
import { database } from '$lib/db/duck.svelte';

const db = database({
  flights: 'flights.parquet'
});

const flights = db.from('flights');

That's the tidyduck equivalent of library(tidyverse) and flights. The database() function creates a named registry that maps table names to parquet file paths in static/data/. Calling db.from() returns a query builder — nothing runs until you materialize it with a verb like .rows().

nycflights13

We use the nycflights13 dataset throughout — all 0 flights that departed from New York City in 2013. The same dataset used in R for Data Science. Except here, the data lives in a .parquet file and is queried entirely in your browser.

In R, typing flights prints the first few rows of the tibble. The tidyduck equivalent is .head():

svelte
const preview = flights.head();  // first 6 rows (default)
live 1.1

Loading…

For a transposed view showing every column with its type and sample values — like R's glimpse() — use .glimpse():

svelte
const info = flights.glimpse();
// info.columns → [{ name, type, sample }, ...]
// info.nRows, info.nCols
live 1.2

Loading…

Finally, for a statistical overview of every column — like R's summary() or Observable's table summary — use .describe(). It runs DuckDB's SUMMARIZE under the hood, giving you min, max, quantiles, and null percentages in one call:

svelte
const summary = flights.describe();
// summary.rows → [{ column_name, column_type, min, max, q25, q50, q75, avg, ... }, ...]
live 1.3

Loading…

dplyr basics

In dplyr, every verb takes a data frame as its first argument and returns a new data frame. The pipe |> threads them together. In tidyduck, db.from() creates an immutable builder. Each method returns a new builder — the original is never modified:

svelte
// dplyr:  flights |> filter(...) |> arrange(...)
// tidyduck:
const q = flights
  .between('year', () => selectedYear)
  .in('carrier', () => selectedCarriers)
  .eq('origin', () => selectedOrigin);

Filter methods — .where(), .between(), .in(), .ilike(), .eq() — add reactive clauses. Materialization methods — .rows(), .count(), .distinct(), .summarize() — actually run the query and return reactive results.

Because each filter method takes a function (not a raw value), the query automatically re-runs when your Svelte state changes. No manual subscriptions, no useEffect — just Svelte 5 runes doing their thing.

dplyr → tidyduck

Conceptdplyrtidyduck
Data sourceflights (tibble)db.from('flights')
Chaining|> pipeMethod chaining (.where().in())
ImmutabilityEach verb returns a new tibbleEach method returns a new builder
Lazy evaluationEager (runs immediately)Lazy until .rows(), .count(), etc.

Rows

The most important verbs that operate on rows are filter() (keep rows matching a condition), arrange() (reorder rows), and distinct() (find unique rows). Here's how each maps to tidyduck.

filter() → .where(), .eq(), .in(), .between(), .ilike()

dplyr's filter() keeps rows based on conditions. In tidyduck, the builder offers several specialized filter methods that handle common patterns and automatically skip inactive filters:

r
# dplyr: flights that departed more than 120 minutes late
flights |> 
  filter(dep_delay > 120)
svelte
// tidyduck
const delayed = flights.where(() => "dep_delay > 120").rows();
live 2.1
120 min

0 flights with dep_delay > 120

Loading…

The .where() method accepts any SQL expression. For common patterns, use the specialized helpers:

r
# dplyr: flights in January or February
flights |> filter(month %in% c(1, 2))
svelte
// tidyduck: .in() works for string columns
// For numeric columns like month, use .where() with raw SQL:
let selectedMonths = $state([1, 2]);

const janFeb = flights
  .where(() => selectedMonths.length === 0
    ? null
    : `month IN (${selectedMonths.join(', ')})`
  )
  .rows();
live 2.2

0 flights in month(s) 1, 2

Here's the full set of filter helpers and when to reach for each:

MethodSQL generatedSkips when
.eq(col, () => val)col = 'val'value is null
.in(col, () => arr)col IN ('a','b')array is empty
.between(col, () => [lo,hi])col BETWEEN lo AND himatches full range
.ilike(col, () => str)col ILIKE '%str%'string is empty
.where(() => expr)any SQL expressionreturns null

Automatic skip behavior. Every filter method gracefully handles "no selection" states. An empty array in .in(), a blank string in .ilike(), or null in .eq() means the filter is simply ignored — no if statements needed. This is one of tidyduck's most powerful features: your query adapts to the UI state automatically.

Combining conditions with or()

Chained filters combine with AND, just like comma-separated conditions in dplyr's filter(). For OR logic, use the or() helper inside .where():

svelte
// Search across multiple columns at once
let search = $state('');

const results = flights
  .where(() => or(
    ilike('carrier', search),
    ilike('dest', search),
    ilike('origin', search)
  ))
  .count();
live 2.3

0 flights (no filter — type to search)

arrange()

dplyr's arrange() reorders rows. In tidyduck, .arrange() sets the ORDER BY clause. Use plain column names for ascending order, or wrap with desc() for descending — just like dplyr. NULLs always sort last, matching R's behavior:

r
# dplyr: most delayed flights first
flights |> arrange(desc(dep_delay))
svelte
import { database, desc } from '$lib/db/duck.svelte';

// tidyduck: same pattern
const mostDelayed = flights.arrange(desc('dep_delay')).rows();

// multiple columns
const sorted = flights.arrange('carrier', desc('dep_delay')).rows();

.arrange() returns a new builder — the original is unmodified. It applies to .rows() and .head(). For grouped queries, use .sql() with ORDER BY directly.

distinct()

dplyr's distinct() finds unique rows. In tidyduck, .distinct() is overloaded just like in dplyr:

r
# dplyr: unique carriers
flights |> distinct(carrier)
# unique origin-dest combos
flights |> distinct(origin, dest)
svelte
// Single column → { items } (flat array, perfect for dropdowns)
const carriers = flights.distinct('carrier');
// carriers.items → ['AA', 'B6', 'DL', 'EV', 'UA', ...]

// Multiple columns → { rows } (unique combinations)
const routes = flights.distinct('origin', 'dest');
// routes.rows → [{ origin: 'EWR', dest: 'ALB' }, ...]

// No args → all unique rows
const unique = flights.distinct();
live 2.4

Loading…

Note. Single-column .distinct('col') returns sorted, non-null values as a flat array — ideal for populating filter chips and dropdowns. Multi-column and no-arg forms return rows, like .rows().

count()

In dplyr, count() is overloaded: with no arguments it gives the total number of rows, and with column names it gives grouped counts sorted in descending order. tidyduck works the same way:

r
# dplyr: count by origin and destination, sorted
flights |> count(origin, dest, sort = TRUE)
svelte
const byRoute = flights.count('origin', 'dest');
// byRoute.rows → [{ origin: 'JFK', dest: 'LAX', n: 11262 }, ...]
live 2.5

Loading…

Columns

The most important verbs that affect the columns without changing the rows are mutate() (creates new columns from existing ones), select() (picks which columns to keep), and rename() (changes column names). Here's how each maps to tidyduck.

mutate()

dplyr's mutate() adds new columns that are calculated from existing columns. In tidyduck, .mutate() takes an object of {alias: "SQL expression"} pairs and appends them to SELECT *:

r
# dplyr
flights |> mutate(
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)
svelte
// tidyduck
const withComputed = flights.mutate({
  gain: 'dep_delay - arr_delay',
  speed: 'ROUND(distance / air_time * 60, 1)'
});
// .rows() → SELECT *, dep_delay - arr_delay AS gain, ... FROM ...
live 3.1

Loading…

Like all builder methods, .mutate() returns a new builder — the original is never modified. Multiple .mutate() calls accumulate:

svelte
// Chain multiple mutate calls
const computed = flights
  .mutate({ gain: 'dep_delay - arr_delay' })
  .mutate({ hours: 'air_time / 60' });

Keep or select. In dplyr, .keep = "used" retains only the columns used in the expression. In tidyduck, combine .mutate() with .select() to pick which columns to keep alongside your new ones.

select()

dplyr's select() lets you zoom in on specific columns. In tidyduck, .select() replaces SELECT * with your chosen columns:

r
# dplyr: select columns by name
flights |> select(year, month, day)
svelte
// tidyduck
const subset = flights.select('year', 'month', 'day');
// .rows() → SELECT year, month, day FROM ...
live 3.2

Loading…

You can also rename columns as you select them, using SQL's AS syntax — just like dplyr's select(new_name = old_name):

svelte
// Rename while selecting
const renamed = flights.select('tailnum AS tail_num', 'carrier', 'dest');

For dplyr-style helpers like starts_with() or contains(), use DuckDB's COLUMNS expression in .sql():

svelte
// DuckDB COLUMNS regex — like select(contains("delay"))
const delays = flights.sql((where) =>
  `SELECT COLUMNS('.*delay.*') FROM 'flights.parquet' ${where}`
);

rename()

dplyr's rename() changes column names while keeping all columns. In tidyduck, .rename() uses DuckDB's EXCLUDE syntax under the hood:

r
# dplyr
flights |> rename(tail_num = tailnum)
svelte
// tidyduck
const renamed = flights.rename({ tail_num: 'tailnum' });
// .rows() → SELECT * EXCLUDE (tailnum), tailnum AS tail_num FROM ...
live 3.3

Loading…

Note. dplyr also has relocate() to move columns around. In browser-side data analysis, column order rarely matters since you control how data is displayed in your Svelte components. If you need a specific order, use .select() to list columns in the desired sequence.

The Pipe

In R, the pipe |> lets you chain verbs into readable pipelines. In tidyduck, method chaining is the pipe. Because each method returns a new builder, you can thread operations together naturally:

r
# dplyr: find fastest flights to Houston
flights |>
  filter(dest == "IAH") |>
  mutate(speed = distance / air_time * 60) |>
  select(year:day, dep_time, carrier, flight, speed) |>
  arrange(desc(speed))
svelte
// tidyduck: same pipeline
import { database, desc } from '$lib/db/duck.svelte';

const fastest = flights
  .where(() => "dest = 'IAH'")
  .mutate({ speed: 'ROUND(distance / air_time * 60, 1)' })
  .select('year', 'month', 'day', 'dep_time', 'carrier', 'flight', 'speed')
  .arrange(desc('speed'))
  .head(10);

Even though this pipeline has five steps, it's easy to read because each method starts a new line. Without chaining, you'd need to write raw SQL:

svelte
// Without the builder — same query, harder to compose
const fastest = duck(() => `
  SELECT year, month, day, dep_time, carrier, flight,
    ROUND(distance / air_time * 60, 1) AS speed
  FROM 'flights.parquet'
  WHERE dest = 'IAH'
  ORDER BY speed DESC
  LIMIT 10
`);

The builder approach has two advantages: readability (each verb is a clear step) and composability (you can fork a builder to create multiple views of the same data):

svelte
// Fork from a shared base
const houston = flights.where(() => "dest = 'IAH'");

const topCarriers = houston.count('carrier');  // who flies there most?
const delays      = houston.summarize({ avg: 'ROUND(AVG(arr_delay), 1)' });
const fastest     = houston
  .mutate({ speed: 'ROUND(distance / air_time * 60, 1)' })
  .arrange(desc('speed'))
  .head(5);

Groups

So far you've learned about functions that work with rows and columns. The builder gets even more powerful when you add in the ability to work with groups. In this section, we'll focus on group_by(), summarize(), and the slice_ family.

group_by()

In dplyr, group_by() is a separate step that marks a data frame for subsequent grouped operations. In tidyduck, there's no separate group_by() — instead, grouping is always per-operation, just like dplyr's newer .by argument. You pass grouping columns directly to .summarize(), .count(), or .sliceMax():

r
# dplyr: two equivalent ways
flights |> group_by(month) |> summarize(avg = mean(dep_delay, na.rm = TRUE))
flights |> summarize(avg = mean(dep_delay, na.rm = TRUE), .by = month)
svelte
// tidyduck: grouping is always inline (like .by)
const monthly = flights.summarize({
  avg_delay: 'ROUND(AVG(dep_delay), 1)'
}, 'month');

This means you never need to ungroup() — each operation is self-contained. This is simpler and avoids the common dplyr pitfall of forgetting to ungroup.

summarize()

The .summarize() method takes a dictionary of {alias: "SQL_EXPRESSION"} pairs. Without a by argument, it aggregates the whole table:

r
# dplyr: overall summary
flights |> summarize(
  avg_delay = mean(arr_delay, na.rm = TRUE),
  max_delay = max(arr_delay, na.rm = TRUE),
  n = n()
)
svelte
const stats = flights.summarize({
  avg_delay: 'ROUND(AVG(arr_delay), 2)',
  max_delay: 'MAX(arr_delay)',
  total: 'COUNT(*)'
});
live 5.1

Loading…

Add a by argument to group — the result has one row per group, sorted by the grouping columns:

r
# dplyr: grouped summary
flights |>
  group_by(month) |>
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE),
    n = n()
  )
svelte
// tidyduck: pass grouping column(s) as second argument
const monthly = flights.summarize({
  avg_delay: 'ROUND(AVG(dep_delay), 1)',
  n: 'COUNT(*)'
}, 'month');
live 5.2

Loading…

A note on missing values. In R, mean(x) returns NA if any value is missing — you must explicitly opt in with na.rm = TRUE. SQL takes the opposite default: aggregate functions like AVG, SUM, MIN, and MAX silently skip NULLs. This is convenient, but means missing data won't loudly announce itself the way R encourages. If your analysis depends on understanding missingness, use .describe() to check null_percentage per column before aggregating.

For multiple grouping variables, pass an array:

svelte
// Group by origin and destination
const byRoute = flights.summarize({
  avg_delay: 'ROUND(AVG(arr_delay), 1)',
  n: 'COUNT(*)'
}, ['origin', 'dest']);

In dplyr, you can keep piping after summarize() — e.g., filter(n > 100) to drop small groups. In tidyduck, .summarize() is a terminal verb that materializes results, so you can't chain further. For post-aggregation filtering (HAVING) or custom ordering, use .sql() — it gives you full SQL and still respects your builder's WHERE clause:

r
# dplyr: filter after grouping
flights |>
  group_by(dest) |>
  summarize(avg_delay = mean(arr_delay, na.rm = TRUE), n = n()) |>
  filter(n > 100)
svelte
// tidyduck: use .sql() for HAVING and custom ordering
const byDest = flights.sql((where) => `
  SELECT dest,
    ROUND(AVG(arr_delay), 1) AS avg_delay,
    COUNT(*) AS n
  FROM 'flights.parquet'
  ${where}
  GROUP BY dest
  HAVING COUNT(*) > 100
  ORDER BY avg_delay DESC
`);
live 5.3

Loading…

sliceMax() and sliceMin()

dplyr has a family of slice_ functions for extracting specific rows within groups. The most useful are slice_max() and slice_min(), which find the rows with the largest or smallest values. tidyduck provides .sliceMax() and .sliceMin():

r
# dplyr: most delayed flight per destination
flights |>
  group_by(dest) |>
  slice_max(arr_delay, n = 1)
svelte
// tidyduck: same, with per-operation grouping
const mostDelayed = flights.sliceMax('arr_delay', 1, 'dest');

// Without grouping — just the overall top N
const top5 = flights.sliceMax('arr_delay', 5);

// Bottom N works the same way
const shortest = flights.sliceMin('air_time', 5);
live 5.4

Loading…

By default, tidyduck breaks ties arbitrarily (one row per rank). Like dplyr's with_ties argument, pass { withTies: true } to keep all tied rows:

svelte
// Keep all ties (uses RANK instead of ROW_NUMBER)
const mostDelayed = flights.sliceMax('arr_delay', 1, 'dest', { withTies: true });

Grouped count()

.count() with column names is the quickest way to see group sizes, sorted in descending order — just like dplyr:

r
# dplyr
flights |> count(carrier, sort = TRUE)
svelte
const byCarrier = flights.count('carrier');
// byCarrier.rows → [{ carrier: 'UA', n: 58665 }, { carrier: 'B6', n: 54635 }, ...]
live 5.5

Loading…

Multiple grouping variables work too — you can count by as many columns as you need:

svelte
// Daily flight counts
const daily = flights.count('year', 'month', 'day');

Joins

It's rare that an analysis involves only a single table. The nycflights13 dataset actually has five related tables — flights, airlines, airports, planes, and weather — connected through shared keys like carrier, tailnum, and faa.

This is where database() shines. Register all your tables upfront, then use db.sql() for joins:

svelte
import { database } from '$lib/db/duck.svelte';

const db = database({
  flights: 'flights.parquet',
  airlines: 'airlines.parquet',
  airports: 'airports.parquet',
  planes: 'planes.parquet',
  weather: 'weather.parquet'
});

Mutating joins

A mutating join adds columns from one table based on matching keys — like dplyr's left_join(). In tidyduck, you write SQL joins via db.sql().

For example, adding the full airline name to flights:

r
# dplyr
flights2 |> left_join(airlines)
svelte
// tidyduck
const withAirline = db.sql(t =>
  `SELECT f.*, a.name as airline_name
  FROM ${t.flights} f
  LEFT JOIN ${t.airlines} a ON f.carrier = a.carrier`
);
live 6.1

Loading…

Or finding out what size of plane was flying:

r
# dplyr
flights2 |>
  left_join(planes |> select(tailnum, type, engines, seats))
svelte
// tidyduck
const withPlane = db.sql(t =>
  `SELECT f.year, f.month, f.day, f.carrier, f.flight,
          p.type, p.engines, p.seats
  FROM ${t.flights} f
  LEFT JOIN ${t.planes} p ON f.tailnum = p.tailnum`
);

When keys have different names across tables, you specify them explicitly in the ON clause. For example, joining flights to airports by destination:

r
# dplyr — keys have different names
flights2 |> left_join(airports, join_by(dest == faa))
svelte
// tidyduck — ON spells out the mapping
const withAirport = db.sql(t =>
  `SELECT f.*, a.name, a.lat, a.lon
  FROM ${t.flights} f
  LEFT JOIN ${t.airports} a ON f.dest = a.faa`
);
live 6.2

Loading…

When LEFT JOIN fails to find a match, it fills the new columns with NULL — just like dplyr fills with NA. For example, filtering to a specific tail number that's missing from the planes table:

r
# dplyr
flights2 |>
  filter(tailnum == "N3ALAA") |>
  left_join(planes |> select(tailnum, type, engines, seats))
svelte
// tidyduck — combine builder filter with db.sql() join
const mystery = db.sql(t =>
  `SELECT f.year, f.time_hour, f.origin, f.dest, f.tailnum, f.carrier,
          p.type, p.engines, p.seats
  FROM ${t.flights} f
  LEFT JOIN ${t.planes} p ON f.tailnum = p.tailnum
  WHERE f.tailnum = 'N3ALAA'
  LIMIT 6`
);

The type, engines, and seats columns are all NULL — plane N3ALAA isn't in the planes table.

Specifying join keys

In dplyr, left_join() automatically matches on shared column names (a "natural" join). In SQL you're always explicit: ON f.carrier = a.carrier. This avoids the pitfall where flights and planes both have a year column meaning different things.

Filtering joins

Filtering joins keep or drop rows based on whether they match another table — without adding columns.

A semi-join keeps rows in x that have a match in y. For example, finding airports that are actual destinations:

r
# dplyr
airports |> semi_join(flights2, join_by(faa == dest))
svelte
// tidyduck — semi-join via WHERE EXISTS
      const destAirports = db.sql(t =>
        `SELECT a.*
        FROM ${t.airports} a
        WHERE EXISTS (
          SELECT 1 FROM ${t.flights} f WHERE f.dest = a.faa
        )`
      );

An anti-join is the opposite: rows in x that don't match y. Useful for finding missing data:

r
# dplyr — destinations not in airports table
flights2 |>
  anti_join(airports, join_by(dest == faa)) |>
  distinct(dest)
svelte
// tidyduck — anti-join via WHERE NOT EXISTS
const missingAirports = db.sql<{ dest: string }>(t =>
  `SELECT DISTINCT f.dest
  FROM ${t.flights} f
  WHERE NOT EXISTS (
    SELECT 1 FROM ${t.airports} a WHERE a.faa = f.dest
  )`
);

dplyr joins → tidyduck SQL

dplyrtidyduck (db.sql())
left_join(y)LEFT JOIN y ON ...
inner_join(y)INNER JOIN y ON ...
right_join(y)RIGHT JOIN y ON ...
full_join(y)FULL OUTER JOIN y ON ...
semi_join(y)WHERE EXISTS (SELECT 1 FROM y ...)
anti_join(y)WHERE NOT EXISTS (SELECT 1 FROM y ...)
cross_join(y)CROSS JOIN y

Joins are where the builder hands off to raw SQL — and that's by design. SQL's JOIN syntax is already concise and universally understood. The database() registry handles the tedious part (file paths), and db.sql() gives you the full power of DuckDB's SQL engine, including inequality joins, ASOF joins, and CTEs.

Putting It All Together

The real power of tidyduck is that every query is reactive. Bind your Svelte state to the builder, and the UI updates automatically when filters change. Try it — filter by carrier and search for an airport:

interactive demo

0 flights match (no filters active — try selecting some carriers)

svelte
  <script lang="ts">
  import { database, or, ilike } from '$lib/db/duck.svelte';

  const db = database({ flights: 'flights.parquet' });
  const flights = db.from('flights');

  let search = $state('');
  let selectedCarriers = $state<string[]>([]);

  const q = flights
    .in('carrier', () => selectedCarriers)
    .where(() => or(
      ilike('origin', search),
      ilike('dest', search)
    ));

  const total  = q.count();
  const byDest = q.count('dest');
</script>

<p>{total.value} flights match your filters</p>

{#each byDest.rows.slice(0, 10) as route}
  <p>{route.dest}: {route.n}</p>
{/each}

No useEffect. No manual query invalidation. Change search or selectedCarriers, and every materialized query re-runs automatically.

Quick reference: dplyr → tidyduck

dplyrtidyduck
DBI::dbConnect(...)database({ name: "file.parquet" })
filter(col == val).eq('col', () => val)
filter(col %in% c(...)).in('col', () => [...])
filter(col >= lo & col <= hi).between('col', () => [lo, hi])
filter(str_detect(col, pat)).ilike('col', () => pat)
filter(a | b).where(() => or(a, b))
arrange(desc(col)).arrange(desc('col'))
distinct(col).distinct('col')
mutate(speed = dist/time).mutate({ speed: "dist/time" })
select(year, month).select('year', 'month')
rename(new = old).rename({ new: "old" })
count(col, sort=T).count('col')
summarize(avg = mean(x)).summarize({ avg: "AVG(x)" })
group_by(col) |> summarize().summarize({...}, 'col')
slice_max(col, n=1, by=grp).sliceMax('col', 1, 'grp')
slice_max(with_ties=TRUE).sliceMax('col', 1, 'grp', { withTies: true })
slice_min(col, n=1, by=grp).sliceMin('col', 1, 'grp')
left_join(y, join_by(k))db.sql(t => `... LEFT JOIN ...`)
semi_join(y)db.sql(t => `... WHERE EXISTS ...`)
anti_join(y)db.sql(t => `... WHERE NOT EXISTS ...`)