Lazy pipelines compile to SQL CTEs. DuckDB handles the compute. The BEAM handles distribution. Graph analytics, cross-source joins, Nx interop — one library, one engine, fully exposed.
Every operation is a function on %Dux{}. Filter, mutate,
group, summarise, join — they all return a new struct. Nothing hits
DuckDB until you call compute/1.
require Dux Dux.from_parquet("s3://data/sales/**/*.parquet") |> Dux.filter(amount > 100 and region == ^selected_region) |> Dux.mutate(revenue: price * quantity) |> Dux.group_by(:product) |> Dux.summarise(total: sum(revenue), orders: count(product)) |> Dux.sort_by(desc: :total) |> Dux.to_parquet("results.parquet", compression: :zstd)
Bare identifiers are columns.
amount, region, price — the
Query macro captures the expression as AST.
^ is interpolation. Elixir values
become parameter bindings ($1, $2). SQL
injection safe by construction.
DuckDB reads S3 directly.
httpfs extension, no ExAws, no Req. Parquet glob
expansion is native.
Everything is lazy until
to_parquet/2 triggers compute/1. The
full pipeline compiles to a single SQL query with CTEs.
Each verb becomes a CTE. The pipeline is a chain. DuckDB handles all pushdown optimisation.
Dux.from_parquet("sales.parquet") |> Dux.filter(year > 2024) |> Dux.mutate(revenue: price * quantity) |> Dux.group_by(:region) |> Dux.summarise(total: sum(revenue)) |> Dux.compute()
WITH __s0 AS ( SELECT * FROM read_parquet('sales.parquet') WHERE year > 2024 ), __s1 AS ( SELECT *, (price * quantity) AS revenue FROM __s0 ), __s2 AS ( SELECT region, SUM(revenue) AS total FROM __s1 GROUP BY region ) SELECT * FROM __s2
Spark proved that DataFrames can be distributed. But Spark carries the weight of the JVM — slow startup, heavy serialisation, complex cluster management. The BEAM already has all of this, battle-tested for 30+ years.
# Workers auto-register via :pg workers = Dux.Remote.Worker.list() Dux.from_parquet("s3://lake/events/**/*.parquet") |> Dux.distribute(workers) |> Dux.filter(amount > 100) |> Dux.group_by(:region) |> Dux.summarise(total: sum(amount)) |> Dux.compute()
No function serialisation. %Dux{} is
plain data — an AST of operations. Ship it to any BEAM node,
compile to SQL there. The code is already deployed.
No cluster manager. libcluster +
:pg. Nodes discover each other, join process groups,
start working.
Arrow IPC over BEAM distribution. Same bytes in memory and on wire. 6 bytes framing overhead. Backpressure, node monitoring, fault tolerance — all for free.
Lattice-based streaming merge. SUM, COUNT, MIN, MAX — partial results fold as workers complete. Progressive results before the last worker finishes.
Vertices + edges. Every graph algorithm reduces to joins and
aggregations. PageRank is iterative join → group_by → sum → normalise.
Shortest path uses DuckDB's USING KEY recursive CTEs
(SIGMOD 2025).
graph = Dux.Graph.new(vertices: users, edges: follows) # Who is most influential? graph |> Dux.Graph.pagerank(iterations: 20) |> Dux.sort_by(desc: :rank) |> Dux.head(10) # Shortest path via USING KEY recursive CTE graph |> Dux.Graph.shortest_paths(start_node) # Connected components, triangle counting graph |> Dux.Graph.connected_components() graph |> Dux.Graph.triangle_count() # Graph results are %Dux{} — pipe into any verb graph |> Dux.Graph.pagerank() |> Dux.join(departments, on: :id) |> Dux.group_by(:dept) |> Dux.summarise(avg_rank: avg(rank)) |> Dux.compute()
Attach databases, reference tables, join across sources. DuckDB pushes filters down to remote databases. No ETL, no staging.
Dux.attach(:warehouse, "postgresql://prod/analytics", type: :postgres) orders = Dux.from_parquet("s3://exports/orders/*.parquet") customers = Dux.from_attached(:warehouse, "public.customers") orders |> Dux.join(customers, on: :customer_id) |> Dux.filter(order_date > ~D[2026-01-01]) |> Dux.group_by(:region) |> Dux.summarise(revenue: sum(amount)) |> Dux.compute()
DuckDB stores Arrow columnar data internally. After
compute/1, numeric columns become Nx tensors via pointer
passing. Dux implements Nx.LazyContainer.
tensor = Dux.to_tensor(df, :price) # #Nx.Tensor<f64[1000] [...]>
The _with variants accept raw DuckDB SQL strings. Window
functions, CASE WHEN, PIVOT, arbitrary CTEs — anything DuckDB can do.
# Window functions Dux.mutate_with(df, rank: "ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC)") # Any DuckDB SQL Dux.from_query("PIVOT sales ON product USING SUM(amount) GROUP BY region")
%Dux{} is a source + ops list — plain Elixir data,
freely serialisable.
^ interpolations become $1,
$2 bindings.
:pg for discovery. :erpc for fan-out.
Supervisor trees for fault tolerance. No JVM, no cluster manager.
Add kino_dux to your Livebook setup cell. Lazy pipelines render with source provenance, accumulated operations, and generated SQL. Computed results become interactive data tables.
Mix.install([ {:dux, "~> 0.3.0"}, {:kino_dux, "~> 0.1"} ])
# Read Dux.from_csv("data.csv", delimiter: "\t") Dux.from_parquet("data/**/*.parquet") Dux.from_ndjson("events.ndjson") Dux.from_excel("sales.xlsx", sheet: "Q1") Dux.from_query("SELECT * FROM read_parquet('s3://bucket/data.parquet')") # Write Dux.to_csv(df, "output.csv") Dux.to_parquet(df, "output.parquet", partition_by: [:year, :month]) Dux.to_excel(df, "report.xlsx") Dux.to_ndjson(df, "output.ndjson") Dux.insert_into(df, "pg.public.events", create: true)