Distributed DuckDB-native DataFrames for Elixir.
Pipelines compile to SQL. The BEAM handles the rest.

Lazy pipelines compile to SQL CTEs. DuckDB handles the compute. The BEAM handles distribution. Graph analytics, cross-source joins, Nx interop — one library, one engine, fully exposed.

# mix.exs
{:dux, "~> 0.3.0"}

Verbs that pipe

Every operation is a function on %Dux{}. Filter, mutate, group, summarise, join — they all return a new struct. Nothing hits DuckDB until you call compute/1.

elixir
require Dux

Dux.from_parquet("s3://data/sales/**/*.parquet")
|> Dux.filter(amount > 100 and region == ^selected_region)
|> Dux.mutate(revenue: price * quantity)
|> Dux.group_by(:product)
|> Dux.summarise(total: sum(revenue), orders: count(product))
|> Dux.sort_by(desc: :total)
|> Dux.to_parquet("results.parquet", compression: :zstd)

Bare identifiers are columns. amount, region, price — the Query macro captures the expression as AST.

^ is interpolation. Elixir values become parameter bindings ($1, $2). SQL injection safe by construction.

DuckDB reads S3 directly. httpfs extension, no ExAws, no Req. Parquet glob expansion is native.

Everything is lazy until to_parquet/2 triggers compute/1. The full pipeline compiles to a single SQL query with CTEs.

Elixir expressions compile to SQL

Each verb becomes a CTE. The pipeline is a chain. DuckDB handles all pushdown optimisation.

elixir
Dux.from_parquet("sales.parquet")
|> Dux.filter(year > 2024)
|> Dux.mutate(revenue: price * quantity)
|> Dux.group_by(:region)
|> Dux.summarise(total: sum(revenue))
|> Dux.compute()
compiles to
sql
WITH __s0 AS (
  SELECT * FROM read_parquet('sales.parquet')
  WHERE year > 2024
),
__s1 AS (
  SELECT *, (price * quantity) AS revenue
  FROM __s0
),
__s2 AS (
  SELECT region, SUM(revenue) AS total
  FROM __s1 GROUP BY region
)
SELECT * FROM __s2
No LazySeries intermediate. Query macro → AST traversal → SQL. One step.

DuckDB per node. The BEAM for everything else.

Spark proved that DataFrames can be distributed. But Spark carries the weight of the JVM — slow startup, heavy serialisation, complex cluster management. The BEAM already has all of this, battle-tested for 30+ years.

elixir
# Workers auto-register via :pg
workers = Dux.Remote.Worker.list()

Dux.from_parquet("s3://lake/events/**/*.parquet")
|> Dux.distribute(workers)
|> Dux.filter(amount > 100)
|> Dux.group_by(:region)
|> Dux.summarise(total: sum(amount))
|> Dux.compute()

No function serialisation. %Dux{} is plain data — an AST of operations. Ship it to any BEAM node, compile to SQL there. The code is already deployed.

No cluster manager. libcluster + :pg. Nodes discover each other, join process groups, start working.

Arrow IPC over BEAM distribution. Same bytes in memory and on wire. 6 bytes framing overhead. Backpressure, node monitoring, fault tolerance — all for free.

Lattice-based streaming merge. SUM, COUNT, MIN, MAX — partial results fold as workers complete. Progressive results before the last worker finishes.

A graph is two dataframes

Vertices + edges. Every graph algorithm reduces to joins and aggregations. PageRank is iterative join → group_by → sum → normalise. Shortest path uses DuckDB's USING KEY recursive CTEs (SIGMOD 2025).

elixir
graph = Dux.Graph.new(vertices: users, edges: follows)

# Who is most influential?
graph
|> Dux.Graph.pagerank(iterations: 20)
|> Dux.sort_by(desc: :rank)
|> Dux.head(10)

# Shortest path via USING KEY recursive CTE
graph |> Dux.Graph.shortest_paths(start_node)

# Connected components, triangle counting
graph |> Dux.Graph.connected_components()
graph |> Dux.Graph.triangle_count()

# Graph results are %Dux{} — pipe into any verb
graph
|> Dux.Graph.pagerank()
|> Dux.join(departments, on: :id)
|> Dux.group_by(:dept)
|> Dux.summarise(avg_rank: avg(rank))
|> Dux.compute()

Postgres + S3 + Parquet in one pipeline

Attach databases, reference tables, join across sources. DuckDB pushes filters down to remote databases. No ETL, no staging.

elixir
Dux.attach(:warehouse, "postgresql://prod/analytics", type: :postgres)

orders    = Dux.from_parquet("s3://exports/orders/*.parquet")
customers = Dux.from_attached(:warehouse, "public.customers")

orders
|> Dux.join(customers, on: :customer_id)
|> Dux.filter(order_date > ~D[2026-01-01])
|> Dux.group_by(:region)
|> Dux.summarise(revenue: sum(amount))
|> Dux.compute()
Postgres, MySQL, SQLite, Iceberg, Delta Lake, DuckLake — all via DuckDB ATTACH.

Nx tensors, zero-copy

DuckDB stores Arrow columnar data internally. After compute/1, numeric columns become Nx tensors via pointer passing. Dux implements Nx.LazyContainer.

elixir
tensor = Dux.to_tensor(df, :price)
# #Nx.Tensor<f64[1000] [...]>

When the macro doesn't cover it

The _with variants accept raw DuckDB SQL strings. Window functions, CASE WHEN, PIVOT, arbitrary CTEs — anything DuckDB can do.

elixir
# Window functions
Dux.mutate_with(df, rank: "ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC)")

# Any DuckDB SQL
Dux.from_query("PIVOT sales ON product USING SUM(amount) GROUP BY region")

What's in the box

api
Dux
The module is the dataframe. All verbs live here. %Dux{} is a source + ops list — plain Elixir data, freely serialisable.
compiler
Query → SQL
Macro captures expressions as AST. Direct traversal to DuckDB SQL. ^ interpolations become $1, $2 bindings.
engine
DuckDB via ADBC
Pure Elixir ADBC wrapper. Zero Rust. Precompiled driver downloaded at compile time. Arrow columnar storage.
distribution
Coordinator
Partition, fan-out, merge. PipelineSplitter decomposes ops into worker-safe + coordinator-only. Arrow IPC wire format.
runtime
BEAM
:pg for discovery. :erpc for fan-out. Supervisor trees for fault tolerance. No JVM, no cluster manager.
graph
Dux.Graph
PageRank, shortest paths, connected components, triangle counting. Algorithms are verb compositions. Distributable.

Rich rendering out of the box

Add kino_dux to your Livebook setup cell. Lazy pipelines render with source provenance, accumulated operations, and generated SQL. Computed results become interactive data tables.

elixir
Mix.install([
  {:dux, "~> 0.3.0"},
  {:kino_dux, "~> 0.1"}
])
Smart cells for SQL preview, data source browsing, chart building, and FLAME cluster configuration.

Read and write anything DuckDB can

elixir
# Read
Dux.from_csv("data.csv", delimiter: "\t")
Dux.from_parquet("data/**/*.parquet")
Dux.from_ndjson("events.ndjson")
Dux.from_excel("sales.xlsx", sheet: "Q1")
Dux.from_query("SELECT * FROM read_parquet('s3://bucket/data.parquet')")

# Write
Dux.to_csv(df, "output.csv")
Dux.to_parquet(df, "output.parquet", partition_by: [:year, :month])
Dux.to_excel(df, "report.xlsx")
Dux.to_ndjson(df, "output.ndjson")
Dux.insert_into(df, "pg.public.events", create: true)
S3, HTTP, Postgres, MySQL, SQLite, Excel, DuckLake — all via DuckDB extensions. No separate libraries.