Knowledge Base

Chunking in NQL

Overview

Chunking is an execution strategy that lets NQL process large datasets in smaller, time-bounded segments so queries run more reliably and cost‑effectively. It is transparent to users: you write normal NQL, and the platform decides when and how to split the work.

  • Designed for stability on very large datasets (10s of GBs to 100+ TB).
  • Reduces the risk of long-running cluster failures and wasted spend.
  • Preserves the query’s intent and output semantics while improving operability.

What problem does chunking solve?

  • Stability: Very large, single-shot scans can fail due to resource pressure. Chunking caps the work per step.
  • Cost control: Failures late in a long job waste compute. Chunking limits the blast radius and simplifies retries.
  • Scalability: Breaking work into pieces lets the platform scale more predictably across diverse dataset sizes.

How it works (high level)

  • NQL evaluates whether a query is eligible for chunked execution.
  • If eligible, the platform splits scanning work into ranges based on a last-modified timestamp (e.g., nio_last_modified_at / last_modified_at).
  • Each chunk runs independently, and results are combined to produce the final output as if the query ran once.
  • There are no user-facing knobs or new syntax—chunking is automatic when it helps.

Eligibility (in scope)

Chunking applies to queries that primarily scan a single dataset (Narrative or company data) with standard projections and filters.

  • Supported data sources include company_data datasets and narrative.rosetta_stone.
  • Typical shapes: a scan with filters and projections.

Out of scope (for now)

To preserve correctness and keep the mental model simple, these shapes are not chunked:

  • Complex joins (multi-table joins, especially when join semantics span large time ranges)
  • Unions and other set operations
  • Global sorts, aggregates, and windowing that depend on complete data in a single pass
  • Table functions that change scan semantics

Note: Queries that include these operations still run; they just may not use chunking.

Example

This is a standard NQL materialized view. If the platform determines it is eligible, it may run using chunking automatically—no changes needed.

CREATE MATERIALIZED VIEW "large_user_enrichment"
  REFRESH_SCHEDULE = '@weekly'
  WRITE_MODE = 'overwrite'
AS
SELECT
  user_id,
  email."value" AS email_sha256,
  country_code,
  last_seen_at
FROM company_data."large_source_dataset"
WHERE last_seen_at >= CURRENT_TIMESTAMP - INTERVAL '180' DAY;
  • The query shape (scan + filters + projections) makes it a good candidate for chunking.
  • The platform may split work by last-modified ranges to improve reliability on very large inputs.

Fairness and expected results

  • The goal is to produce results comparable to a single-pass run, while distributing work over time.
  • Minor differences can occur at chunk boundaries due to processing order or concurrent updates, but results should be materially the same for typical use cases.

Creating Materialized Views

Learn how to define and manage materialized views in NQL.

< Back
Rosetta

Hi! I’m Rosetta, your big data assistant. Ask me anything! If you want to talk to one of our wonderful human team members, let me know! I can schedule a call for you.