Knowledge Base

Recursive CTEs in NQL

This guide introduces recursive CTEs (Common Table Expressions) in NQL, showing how you can build graphs, compute hierarchical relationships, or propagate identifiers through relational data. We'll also discuss how to limit recursion to avoid unbounded or cyclic expansions.

1. What Is a Recursive CTE?

A recursive CTE is a query structure that references itself, allowing you to iteratively build a result set. This is helpful for graph or tree-like problems, such as:

Hierarchical data (e.g., org charts, folder structures).
Connected components (e.g., identity resolution across multiple identifiers).
Path traversal (e.g., social network distance, linked data chains).

A recursive CTE typically consists of two parts:

Anchor Query: A base query that provides an initial set of rows.
Recursive Query: A query that references the CTE itself, appending or propagating new rows at each iteration.

2. Syntax Overview

Below is a generic example of a recursive CTE in pseudo-NQL:

WITH RECURSIVE cte_name (col1, col2, ...) AS (
    -- Anchor query
    SELECT initial_values
    FROM some_table
    WHERE conditions

    UNION ALL

    -- Recursive query referencing cte_name
    SELECT t.col1, t.col2, ...
    FROM cte_name
    JOIN other_table t
    ON cte_name.some_column = t.other_column
    WHERE recursion_conditions
      AND cte_name.depth_column < 5  -- Example recursion limit
)
SELECT *
FROM cte_name;

Key points:

WITH RECURSIVE: Declares the CTE as recursive.
Anchor Query: The initial seed rows.
Recursive Query: Joins or connects the existing result set to generate new rows in each iteration.
Recursion Limit: You might include a depth or iteration counter to avoid infinite loops.

3. Why Use Recursive CTEs?

Example Use Cases:

Identity Graph: Combine user identifiers (emails, phone numbers, hashed IDs) into a single "connected component" to unify multiple data points.
Hierarchical Data: Summarize or explore hierarchical relationships, such as organizational structures or product categories.
Network Analysis: Build adjacency lists or compute path lengths in social or domain networks.

4. Example: Building an Identity Graph

Below is a simplified example inspired by the use case of grouping identifiers. Suppose you have a dataset with various ID types, and you want to form connected components (i.e., all IDs that belong to the same person).

WITH RECURSIVE identity_graph (node, component, depth) AS (
    -- 1. Anchor/base case
    SELECT DISTINCT id AS node,
                    id AS component,
                    0 AS depth
    FROM your_dataset
    WHERE id IS NOT NULL

    UNION ALL

    -- 2. Recursive part
   SELECT 
    new_rel.id_b AS node,
    LEAST(g.component, new_rel.id_b) AS component,
    g.depth + 1 AS depth
    FROM identity_graph g
    JOIN relationships new_rel
        ON g.node = new_rel.id_a
    WHERE g.depth < 5           -- recursion limit
      AND new_rel.id_b <> g.component
)
SELECT
    component,
    ARRAY_AGG(DISTINCT node) AS all_nodes,
    COUNT(DISTINCT node) AS node_count
FROM identity_graph
GROUP BY component
ORDER BY node_count DESC;

Step-by-Step:

Anchor/Base Query:
- We select each id as its own "component" and set depth to 0.
Recursive Query:
- We join back on a relationships table (or subquery) where id_a connects to id_b.
- For each node in the previous iteration, we find new id_b values that link to it.
- We increment depth by 1 to track recursion depth.
- We also enforce g.depth < 5 to prevent infinite loops.
Final SELECT:
- We group by component, collecting all connected node values into an array.
- This effectively forms "connected components" or "identity groups."

5. Best Practices for Recursive CTEs

5.1 Aliasing and Readability

For clarity and maintainability, always use meaningful aliases for calculated expressions in your recursive queries:

-- Good practice with clear aliases
SELECT 
    new_rel.id_b AS node,
    LEAST(g.component, new_rel.id_b) AS component,
    g.depth + 1 AS depth
FROM identity_graph g
-- ...

This makes the query more readable and helps others understand your logic. While it's not strictly required for execution, it's highly recommended, especially for complex expressions.

5.2 Handling Depth and Recursion Limits

When setting recursion limits, the filter is applied to the current depth value before incrementing:

WHERE g.depth < 5  -- This will restrict recursion to max 5 levels

This is equivalent to filtering on g.depth + 1 <= 6. The query engine evaluates the WHERE clause on the current row's values before generating the next iteration.

5.3 Dataset Restrictions

Important: Recursive CTEs in NQL should only reference your company's datasets directly. External datasets accessed via access rules or marketplace cannot be directly used inside the recursive portion of CTEs.

You must use datasets that your company owns or has directly imported when writing recursive CTEs. This restriction exists because of how the query planning and execution work for recursive operations.

6. Ensuring Good Performance & Stability

Set a Depth Limit: Recursive queries can loop infinitely if there are cycles. Always consider using a depth column and WHERE depth < X.
Use UNION ALL vs. UNION*: In many cases, UNION ALL is better for performance (unless you specifically need duplicate elimination).
Watch Out for Large Joins: If your relationship table is huge, ensure you index or partition the data to handle iterative joins efficiently.
Plan for Aggregation: Once the recursion completes, you typically gather results using an aggregation or array function.

7. Common Limitations in NQL

Dataset Origin Limitation: Only company-owned datasets (your own data) should be used in recursive CTEs. External datasets from the marketplace or accessed via access rules may not work properly in recursive contexts.
Depth Constraints: Some data planes or execution engines might impose their own recursion limits.
Performance & Execution Time: Large recursion depths or huge relationship sets can lead to significant processing overhead.

Tip: Always test your recursive CTE logic on smaller subsets before running at scale.

8. Practical Tips

Max Recursion Parameter: Introduce a depth or iteration column to avoid unbounded looping.
De-Duplication: If you want to avoid processing the same node multiple times, consider a "visited nodes" approach or use distinct logic in your queries.
Temporary Dataset: Sometimes it's beneficial to write intermediate results to a temporary table or a materialized view if the recursion is especially complex.
Optimizing Join Order: The order of operations in your recursive part can significantly impact performance. Consider which dataset is smaller and optimize your join conditions accordingly.

Conclusion

Recursive CTEs in NQL open up new possibilities for advanced data modeling—especially around graphs and hierarchical relationships. By structuring your queries with anchor and recursive clauses, plus a safeguard on recursion depth, you can unify many disparate identifiers (e.g., emails, phone numbers, hashed IDs) or climb up and down hierarchical data effectively, all within a single NQL statement.

Leverage these patterns to build identity graphs, uncover connected components, or navigate hierarchical structures—all while ensuring you cap recursion to maintain stable performance.