Managing deep hierarchies with RDBMS
One time or the other you might already have or you will come across a scenario where you have to store deep hierarchies in a relational database.
Some such scenarios are product categories, threaded comments, folder structures which might dynamically increase or decrease.
Products -> Electronics -> Phones And Tablets -> Brands-> Moodels
These hierarchies may change over time, the depth may increase or decrease.
For eg, the products hierarchy may change over time as new categories are introduced.
The above example shows a threaded comment from hacker news. You will notice the hierarchy which keeps on increasing endlessly.
How will you store all these comments in database and again fetch them in the same manner?
How do we manage such deep dynamic hierarchies in a relational database?
There are a few ways to go about it, we will discuss the most popular solutions out there:
- Adjacency List
- Nested Sets
- Materialised Path
- Closure Tables
Adjacency List (parent-child relationship)
In this method a simple parent child relationship is used to represent hierarchy.
You will notice every node has a parent id. This helps us to maintain the hierarchy.
Adding a new record is fairly simple. You just need to know the parent id to insert a new record.
- Inserting new records is easy.
- Simple to implement. Only parent id is needed.
- Getting all children of a node is very expensive, as it would require forming the entire tree/sub-tree. At every node the query would be fired recursively.
When to use?
- When you are writing more than you read.
- This is one of the simplest ways to represent hierarchy in relational databases.
- If you are just getting started and don’t have clear idea about the use case, this is the go to approach which fits all requirements.
This is quite an interesting technique to represent hierarchies. This is because this approach considers nodes as sets instead of tree.
Each node has two indices left and right from the root. Indexes are assigned starting from the root node, moving to the left and then right though each node.
Essentially, all the nodes on the left of root node in question will have greater indices than the left index of the root node and smaller than the right index of the root node.
From the above figure it is easy to understand why this approach is called as nested sets. Each node is a set holding indexes of the child nodes.
- Getting all children of a node is very easy.
- Selecting child sets is very easy and can be done using the left and right indices.
- Insertion requires re-indexing the entire tree.
- Deletion requires re-indexing of the entire tree.
- Poor choice when more complex relational data exists for the elements in the tree
When to use?
- When node creation and deletion is less than reads.
- Reporting/complete tree traversals are frequent.
- It’s best used when you need to query a tree more frequently than you need to modify the tree
Comments section — Comments section are less likely to change once created. Sites like hacker news and stack overflow don’t allow user’s to delete their comments after a certain time period. In such a use case nested sets are a good option.
In this approach the idea is to use something called as a lineage column.
So what is this?
Lineage column is simply a path column.
Let’s say we represent path to our root node as “A”.
Then subsequent children will have paths like “AA”, “AB”, “AC”.
Note: In the above example the step length is 1. It can be increased to increase the number of immediate children.
- Insertion and deletion is very easy.
- Traversing the tree is very easy.
- No of immediate children is limited. If we are using alphabets it will be 26 with step length as 1.
- Maximum depth is limited.
When to use?
- Multiple roots might exist.
- Sibling ordering is required.
Good for storing hierarchy where depth is limited like movies genres might increase but within each of them sub-categories won’t increase that often, hence the depth will always be limited.
Closure tables are tables in which you store all paths from one node to all others along with the depth.
Let’s see what a closure table for this case would look like.
Now, it is very easy to query children/parents at any level.
- Insertion and deletion is moderate.
- Querying child/parent at each level is efficient.
- Insertion at deeper levels is costly as a record for each level till the current level will need to be recorded.
- Extra space for storing all paths.
When to use?
- Extra space is not a big deal.
- Tree depth is relatively less.
Comments section — because comments require information of immediate ancestors and descendants.
These are 4 major ways to store hierarchical data in relational databases. Many other approaches like flat tables, adjacency list with CTEs, multiple lineage columns do exists.
There are no correct answers as to which approach is the best it completely depends on the use case/scenario you are dealing with. Factors like number of read, writes and resulting representation affects performance to a huge extent.
In general it is helps to know all approaches that can be used and then make a decision based on your requirements.
Now that you know how to handle hierarchical data with relational databases, see which one works best for you.
Hope you learned something interesting :)