Common Table Expression Recursion
Blog Tuesday, February 01 2011I like using recursion when it helps to make an algorithm clearer. That's not always the case though. For many developers, recursion is difficult to understand. So I'm always on the hunt for code samples that make recursion easier to understand. One good example of this can be found in traversing the foriegn keys in a database using a SQL Server Common Table Expression (CTE).
When I load data into a database, I'd like to know the order I must use to load the tables. Some tables will be at Level Zero (0). Those are the ones that have no dependencies on other tables. They're typically type tables that contain reference data and they can be loaded at any time because they have no foreign keys in them. The next level, Level One (1), are those tables that depend on the Level Zero (0) tables. And so on, and so on. If I knew the load order, my Extract, Transform and Load (ETL) packages could be scripted to run level by level starting at Level Zero (0). A Common Table Expression (CTE) can be used to find that load order using recursion. Let's look at the code first, then I'll describe it.
WITH
[key_info] AS
(
SELECT
OBJECT_SCHEMA_NAME([sfk].[parent_object_id]) AS [from_schema]
, OBJECT_NAME([sfk].[parent_object_id]) AS [from_table]
, OBJECT_SCHEMA_NAME([sfk].[referenced_object_id]) AS [to_schema]
, OBJECT_NAME([sfk].[referenced_object_id]) AS [to_table]
FROM [sys].[foreign_keys] AS [sfk]
WHERE [sfk].[parent_object_id] <> [sfk].[referenced_object_id]
)
, [level_info] AS
(
-- the anchor part
SELECT
[ss].[name] AS [SchemaName]
, [st].[name] AS [TableName]
, 0 AS [ReferenceLevel]
FROM [sys].[tables] AS [st]
INNER JOIN [sys].[schemas] AS [ss]
ON [st].[schema_id] = [ss].[schema_id]
LEFT OUTER JOIN [key_info] AS [ki]
ON [ss].[name] = [ki].[from_schema]
AND [st].[name] = [ki].[from_table]
WHERE [ki].[from_schema] IS NULL
UNION ALL
-- the recursive part
SELECT
[ki].[from_schema]
, [ki].[from_table]
, [li].[ReferenceLevel] + 1
FROM [key_info] AS [ki]
INNER JOIN [level_info] AS [li]
ON [ki].[to_schema] = [li].[SchemaName]
AND [ki].[to_table] = [li].[TableName]
)
SELECT
[li].[SchemaName]
, [li].[TableName]
, MAX([li].[ReferenceLevel]) AS [LoadLevel]
FROM [level_info] AS [li]
GROUP BY [li].[SchemaName], [li].[TableName]
ORDER BY MAX([li].[ReferenceLevel]), [li].[TableName];
The WITH statement kicks off the CTE by establishing two data elements that will be used later on. The first one is called [key_info] and it's just a dump of the foreign keys in the database, capturing parent (from) tables to referenced (to) tables. Care must be taken not to include those keys that refer back to the same table. We don't need them for discovering the load levels and they will cause the recursion to become infinite. More on that in a moment.
The [key_info] part of the WITH statement isn't recursive but the next part named [level_info] is. You can tell it's a recursive CTE by finding the UNION ALL statement in the middle of the [level_info] definition. The part of the query before the UNION ALL is called the anchor part. The part following it is the recursive part. The anchor part of [level_info] builds a list of schemas and tables then LEFT OUTER JOINs them to the [key_info] using the parent (from) side of the keys. The trick to understanding this is in realizing (A) that it's a LEFT OUTER JOIN to the key data and that (B) it uses a WHERE clause that includes only those tables with NULL keys on the parent (from) side of the LEFT OUTER JOIN. In other words, I'm including only those tables that have no foreign key references. Also notice how the [ReferenceLevel] is hard-coded to zero (0) in this anchor part? These are our Level Zero (0) tables: the ones that can be loaded at any time because they don't depend on anything else.
After the anchor part establishes the Level Zero (0) tables, the recursive part below the UNION ALL finishes the job. SQL will execute the recursive part over and over again until no new data is unioned in to the results. That's how it knows when to stop so it's important to make sure that you write the recursive part in such a way that it will actually finish. Had we included self-referencing (same table) keys in the [key_info], we'd have had a real problem getting the recursive part to finish.
The recursive part of this query will finish because the referenced (to) values in the [key_info] are joined with the schema and table combinations in [level_info] from the last recursion pass (or the anchor results if it's the first call to the recursive part, of course). Joining on the referenced (to) side of the key is important because we want to find all the tables already in the [level_info] list that have references "to" them. Then, in the [level_info] data in this pass, we capture the parent (from) data on the matches and increment the [ReferenceLevel] to one greater than the last known value. This could result in the same schema/table combination being included in the recursive results more than once, potentially at different levels. But that's OK. We take care of the duplicates in the last part of the query.
Following the CTE that does all the recursive work is the consumption part of the query. It can be a SELECT statement or an INSERT that selects from the CTE as its source. In this case, we simply select out the data, grouping by the schema and table name and picking off the MAX([ReferenceLevel]). It's important to pick the maximum reference level that was recorded for any table because it may refererence tables at varying load levels. As an example, imagine a table named C that references another table A directly using a foreign key. Table C may also reference a table named B that in turn references table A. In our ungrouped [level_info] results, Table A would appear twice: once as a Level One (1) table and again as a Level Two (2) table. We need to select the Level Two (2) row to make sure that we don't try to load table C before table B is fully loaded; hence the use of the MAX function in the final SELECT statement.
Running this query on the AdventureWorks database, for example, shows that there are eight (8) layers of data to be loaded into this database. In Level Zero (0), we find basic lookup tables like [Person].[AddressType] and [Sales].[SalesReason]. Tables like that don't depend on anything else so they can be updated first. At the other end of the spectrum are the tables at Level Seven (7). These are highly dependent tables like [Sales].[SalesOrderDetails] and [Sales].[SalesOrderHeaderSalesReason]. And, of course, there are dozens of tables in the six (6) layers in between. Try this query on your own databases to see what it reveals about the organization of your data. Missing foreign keys might become more apparent to you by examining the results.
Hopefully, this query helps you to wrap your brain around recursion a bit better, too. Enjoy.

