Knowledge Base
Recursive CTEs in NQL
Recursive CTEs in NQL
This guide introduces recursive CTEs (Common Table Expressions) in NQL, showing how you can build graphs, compute hierarchical relationships, or propagate identifiers through relational data. We'll also discuss how to limit recursion to avoid unbounded or cyclic expansions.
1. What Is a Recursive CTE?
A recursive CTE is a query structure that references itself, allowing you to iteratively build a result set. This is helpful for graph or tree-like problems, such as:
- Hierarchical data (e.g., org charts, folder structures).
- Connected components (e.g., identity resolution across multiple identifiers).
- Path traversal (e.g., social network distance, linked data chains).
A recursive CTE typically consists of two parts:
- Anchor Query: A base query that provides an initial set of rows.
- Recursive Query: A query that references the CTE itself, appending or propagating new rows at each iteration.
2. Syntax Overview
Below is a generic example of a recursive CTE in pseudo-NQL:
WITH RECURSIVE cte_name (col1, col2, ...) AS (
-- Anchor query
SELECT initial_values
FROM some_table
WHERE conditions
UNION ALL
-- Recursive query referencing cte_name
SELECT t.col1, t.col2, ...
FROM cte_name
JOIN other_table t
ON cte_name.some_column = t.other_column
WHERE recursion_conditions
AND cte_name.depth_column < 5 -- Example recursion limit
)
SELECT *
FROM cte_name;
Key points:
- WITH RECURSIVE: Declares the CTE as recursive.
- Anchor Query: The initial seed rows.
- Recursive Query: Joins or connects the existing result set to generate new rows in each iteration.
- Recursion Limit: You might include a
depth
or iteration counter to avoid infinite loops.
3. Why Use Recursive CTEs?
Example Use Cases:
- Identity Graph: Combine user identifiers (emails, phone numbers, hashed IDs) into a single "connected component" to unify multiple data points.
- Hierarchical Data: Summarize or explore hierarchical relationships, such as organizational structures or product categories.
- Network Analysis: Build adjacency lists or compute path lengths in social or domain networks.
4. Example: Building an Identity Graph
Below is a simplified example inspired by the use case of grouping identifiers. Suppose you have a dataset with various ID types, and you want to form connected components (i.e., all IDs that belong to the same person).
WITH RECURSIVE identity_graph (node, component, depth) AS (
-- 1. Anchor/base case
SELECT DISTINCT id AS node,
id AS component,
0 AS depth
FROM your_dataset
WHERE id IS NOT NULL
UNION ALL
-- 2. Recursive part
SELECT
new_rel.id_b AS node,
LEAST(g.component, new_rel.id_b) AS component,
g.depth + 1 AS depth
FROM identity_graph g
JOIN relationships new_rel
ON g.node = new_rel.id_a
WHERE g.depth < 5 -- recursion limit
AND new_rel.id_b <> g.component
)
SELECT
component,
ARRAY_AGG(DISTINCT node) AS all_nodes,
COUNT(DISTINCT node) AS node_count
FROM identity_graph
GROUP BY component
ORDER BY node_count DESC;
Step-by-Step:
- Anchor/Base Query:
- We select each
id
as its own "component" and setdepth
to 0.
- We select each
- Recursive Query:
- We join back on a
relationships
table (or subquery) whereid_a
connects toid_b
. - For each
node
in the previous iteration, we find newid_b
values that link to it. - We increment
depth
by 1 to track recursion depth. - We also enforce
g.depth < 5
to prevent infinite loops.
- We join back on a
- Final SELECT:
- We group by
component
, collecting all connectednode
values into an array. - This effectively forms "connected components" or "identity groups."
- We group by
5. Best Practices for Recursive CTEs
5.1 Aliasing and Readability
For clarity and maintainability, always use meaningful aliases for calculated expressions in your recursive queries:
-- Good practice with clear aliases
SELECT
new_rel.id_b AS node,
LEAST(g.component, new_rel.id_b) AS component,
g.depth + 1 AS depth
FROM identity_graph g
-- ...
This makes the query more readable and helps others understand your logic. While it's not strictly required for execution, it's highly recommended, especially for complex expressions.
5.2 Handling Depth and Recursion Limits
When setting recursion limits, the filter is applied to the current depth value before incrementing:
WHERE g.depth < 5 -- This will restrict recursion to max 5 levels
This is equivalent to filtering on g.depth + 1 <= 6
. The query engine evaluates the WHERE clause on the current row's values before generating the next iteration.
5.3 Dataset Restrictions
Important: Recursive CTEs in NQL should only reference your company's datasets directly. External datasets accessed via access rules or marketplace cannot be directly used inside the recursive portion of CTEs.
You must use datasets that your company owns or has directly imported when writing recursive CTEs. This restriction exists because of how the query planning and execution work for recursive operations.
6. Ensuring Good Performance & Stability
- Set a Depth Limit: Recursive queries can loop infinitely if there are cycles. Always consider using a depth column and
WHERE depth < X
. - Use
UNION ALL
vs.UNION
*: In many cases,UNION ALL
is better for performance (unless you specifically need duplicate elimination). - Watch Out for Large Joins: If your relationship table is huge, ensure you index or partition the data to handle iterative joins efficiently.
- Plan for Aggregation: Once the recursion completes, you typically gather results using an aggregation or array function.
7. Common Limitations in NQL
- Dataset Origin Limitation: Only company-owned datasets (your own data) should be used in recursive CTEs. External datasets from the marketplace or accessed via access rules may not work properly in recursive contexts.
- Depth Constraints: Some data planes or execution engines might impose their own recursion limits.
- Performance & Execution Time: Large recursion depths or huge relationship sets can lead to significant processing overhead.
Tip: Always test your recursive CTE logic on smaller subsets before running at scale.
8. Practical Tips
- Max Recursion Parameter: Introduce a
depth
or iteration column to avoid unbounded looping. - De-Duplication: If you want to avoid processing the same node multiple times, consider a "visited nodes" approach or use
distinct
logic in your queries. - Temporary Dataset: Sometimes it's beneficial to write intermediate results to a temporary table or a materialized view if the recursion is especially complex.
- Optimizing Join Order: The order of operations in your recursive part can significantly impact performance. Consider which dataset is smaller and optimize your join conditions accordingly.
Conclusion
Recursive CTEs in NQL open up new possibilities for advanced data modeling—especially around graphs and hierarchical relationships. By structuring your queries with anchor and recursive clauses, plus a safeguard on recursion depth, you can unify many disparate identifiers (e.g., emails, phone numbers, hashed IDs) or climb up and down hierarchical data effectively, all within a single NQL statement.
Leverage these patterns to build identity graphs, uncover connected components, or navigate hierarchical structures—all while ensuring you cap recursion to maintain stable performance.