Tekko

Language

Get in Touch

Usually respond within 24 hours

Back to BlogArchitecture

Querying Multi-GB Parquet Files in the Browser with DuckDB-Wasm

7 min read
DuckDBWebAssemblyApache ArrowData EngineeringFrontend
Querying Multi-GB Parquet Files in the Browser with DuckDB-Wasm

The traditional architecture for web-based data analytics follows a predictable pattern: a heavy-duty backend (Snowflake, BigQuery, or ClickHouse) processes queries, and a thin client renders the results. While this works for many use cases, it introduces significant latency, high egress costs, and a constant reliance on network availability.

Recent advancements in WebAssembly (Wasm) and columnar data formats have shifted the needle. We are now at a point where we can process multi-gigabyte datasets directly in the user's browser with sub-second response times. By combining DuckDB-Wasm, Apache Arrow, and Parquet, we can build "serverless" analytics tools that are faster, cheaper, and more private than their cloud-reliant predecessors.

The Technical Trio: Why This Stack Works

To understand why this approach is revolutionary, we need to look at the three core technologies involved.

1. DuckDB-Wasm: The Analytical Engine

DuckDB is an in-process SQL OLAP (Online Analytical Processing) database management system. Unlike SQLite, which is optimized for transactional workloads (row-based), DuckDB is optimized for analytical workloads (columnar). DuckDB-Wasm brings this power to the browser. It isn't a mock or a subset; it is the full DuckDB engine compiled to WebAssembly, capable of executing complex SQL queries, joins, and aggregations directly in a browser thread or Web Worker.

2. Apache Arrow: The Memory Standard

Apache Arrow provides a cross-language development platform for in-memory data. It specifies a standardized, column-oriented memory format. In our stack, Arrow acts as the glue. DuckDB processes data in Arrow format, and modern visualization libraries (like Arquero or even raw D3) can consume Arrow buffers without the overhead of serializing and deserializing large JSON objects. This "zero-copy" philosophy is what makes the performance feel native.

3. Parquet: The Storage Efficiency

Parquet is the industry-standard columnar storage format. It is highly compressed and supports "predicate pushdown" and "projection pushdown." This means that if you have a 5GB Parquet file but your SQL query only asks for the total_sales column for the year 2023, the system doesn't need to download the whole 5GB. It only fetches the specific byte ranges required for that column and those rows.

The Secret Sauce: HTTP Range Requests

The most common question I get when proposing this architecture is: "Do we really want to make a user download a 2GB file just to see a chart?"

The answer is no, and you don't have to. DuckDB-Wasm utilizes HTTP Range Requests (standardized in HTTP/1.1). When DuckDB opens a remote Parquet file, it first fetches the metadata (the footer of the Parquet file). This metadata tells DuckDB exactly where each column and row group is located in the file.

When you execute a query, DuckDB calculates the specific byte offsets needed and issues targeted GET requests with a Range: bytes=start-end header. In a typical interactive dashboard, a user might only trigger 5-10MB of data transfer to query a multi-gigabyte dataset. This makes the "large file" problem a non-issue for most analytical scenarios.

Implementing the Architecture

Let's look at how to set up a basic environment to query a remote Parquet file.

Initializing DuckDB-Wasm

First, you need to instantiate the database. It is highly recommended to run DuckDB in a Web Worker to keep the main UI thread responsive.

import * as duckdb from '@duckdb/duckdb-wasm'; const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles(); const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES); const worker = new Worker(bundle.mainWorker); const logger = new duckdb.ConsoleLogger(); const db = new duckdb.AsyncDuckDB(logger, worker); await db.instantiate(bundle.mainModule, bundle.pthreadWorker);

Querying a Remote Parquet File

Once the database is initialized, you can use the db.connect() method to create a session and register a remote file. DuckDB treats the URL as a table.

const conn = await db.connect(); // Querying a 2GB file directly from an S3 bucket or CDN const parquetUrl = 'https://my-data-bucket.s3.amazonaws.com/large_dataset.parquet'; const result = await conn.query(` SELECT category, SUM(price) as total_revenue FROM '${parquetUrl}' WHERE date > '2023-01-01' GROUP BY category ORDER BY total_revenue DESC `); // The result is an Apache Arrow Table console.log(result.toArray()); await conn.close();

Performance Optimization Strategies

While DuckDB-Wasm is incredibly fast, handling multi-gigabyte datasets in a resource-constrained environment like a browser requires intentionality.

1. Optimize Your Parquet Files

Not all Parquet files are created equal. For browser-side querying, you should:

  • Sort your data: If you frequently filter by a specific column (e.g., timestamp), sort the Parquet file by that column. This allows DuckDB to skip entire row groups using the metadata.
  • Use smaller row groups: Default row groups in Spark or Dask might be too large. Aim for row groups that result in 10-50MB of uncompressed data to allow for more granular HTTP range requests.
  • Choose the right compression: Snappy or Zstd are generally preferred for a balance between CPU usage and file size.

2. Manage Memory Constraints

Browsers impose memory limits on WebAssembly (usually 2GB to 4GB depending on the browser and OS). If you try to load a massive dataset entirely into memory, the tab will crash.

Always leverage the read_parquet('url') function directly in your SQL rather than trying to INSERT the data into an in-memory DuckDB table unless the dataset is small. When querying a URL, DuckDB uses its buffer manager to stream data, keeping the memory footprint low.

3. Connection Pooling

Creating a new connection for every query adds overhead. Maintain a persistent connection for the duration of the user session. However, remember that DuckDB-Wasm is currently single-writer. If you have multiple workers trying to write to the same in-memory database, you'll need to coordinate them.

Real-World Use Cases

Internal Data Exploration Tools

At many companies, data scientists share results via static CSVs or massive JSON dumps. Building a simple React wrapper around DuckDB-Wasm allows teammates to drop a Parquet file into the browser and perform ad-hoc SQL analysis without setting up a Python environment or uploading sensitive data to a third-party SaaS.

Log Analysis Dashboards

CloudWatch or ELK can be expensive. If you export your logs to S3 in Parquet format, you can build a client-side dashboard that queries those logs directly. This bypasses the need for an expensive indexing cluster while providing the same (or better) query performance for historical analysis.

Edge Analytics for SaaS

If you provide analytics to your customers, you can reduce your server costs significantly. Instead of running expensive aggregation queries on your backend for every user interaction, you can serve the raw (or pre-aggregated) Parquet files via a CDN and let the user's CPU do the work. This results in a snappier UI and a lower AWS bill.

The Trade-offs

No architecture is a silver bullet. Here are the limitations you must consider:

  • Cold Start: The initial Wasm binary download (several megabytes) can be a hurdle for one-off page visits. Use service workers to cache the binaries.
  • CORS: Since you are making range requests to a data source (like S3), you must configure Cross-Origin Resource Sharing (CORS) headers correctly on your storage bucket to allow Range headers.
  • Client Capabilities: A user on a 5-year-old budget smartphone will struggle with complex joins on millions of rows. This stack is best suited for desktop environments or modern mobile devices.

Conclusion: Your Action Plan

Implementing client-side analytics with DuckDB-Wasm and Apache Arrow isn't just about showing off technical prowess; it's about fundamentally changing the cost and performance dynamics of data applications.

To get started:

  1. Convert your data: Use a tool like Pandas or the DuckDB CLI to convert your existing CSV/JSON datasets into optimized, sorted Parquet files.
  2. Host on a CDN: Place these files on a service that supports HTTP Range Requests (S3, R2, or even a well-configured Nginx server).
  3. Prototype Small: Start by replacing a single heavy API endpoint with a client-side DuckDB query. Measure the latency and egress savings.

By moving the compute to the data—or in this case, moving the compute to the user—you unlock a level of interactivity that traditional architectures simply cannot match.