DAG vs. Tree
I am wondering whether the data structure “tree” might be a giant mistake in the history of computer science. Almost everything that can be saved in a tree is really a DAG.
Some background: A tree is a data structure consisting of nodes, where each node has one or no parent (this is technically a forest; a tree has only one node with no parent). Each node has children; specifically the nodes that have that node as its parent. Circles are forbidden.
The best-known example for this are computer file systems. Each file is in a folder. That folder is in another folder and so on, until you reach the drive, which forms the root of the tree. Mail lists and some kinds of forums and comment systems also work like that: Each post is a reply to exactly one previous post (at most) and gets shown a layer below. However, tree structures predate computer science. The whole field of Taxonomy as well as linguistics, both with their habit of grouping things in families and sub-families and so on, do exactly the same thing.
It’s easy to process trees automatically, for example to find all children. You can also identify each child uniquely through a given path.
There is just one problem with them: Reality does not work like that. In most real-world cases things, can have multiple parents. A classic example: What folder does a file fit in? Likewise, in discussions, sometimes you want to reply to several posts that make similar points at once. In linguistics, languages don’t really belong to one family exclusively, they have influence from lost of different ones.
To get around that, one either has to pick a parent explicitly, or split the object. Splitting works for discussions well, for files sometimes, for e.g. classification of car types not at all. In either case, information about the semantic relationships is lost.
The answer to that is called DAG, for Directed Acyclic Graph. The rules are almost the same as a tree, only a node can have more than one parent as well. An example: Media library apps like iTunes allow a song to appear in several playlists at once.
DAGs represent the examples here naturally. A file can be in two folders, for example. Since a tree is a special case of a DAG, it’s also easily possible to represent the cases that actually are tree-like.
In reality many systems already include hacks to create DAG structures. For example, thanks to Hard links, most file systems actually are DAGs, they just don’t show it that often. Symlinks, Aliases and Windows shortcut files all create a “second-class parent”, which explicitly stores the removed relationships.
Another approach is through tag systems, where a file, blog post or similar can have an arbitrary number of tags. This structure is actually just a two-level DAG. That may seem like a solution, but since it isn’t possible to form relationships between tags, and because many use cases of trees can’t be modeled that way (for example discussions), it isn’t a full replacement.
Admittedly, from a technical point of view, DAGs are much harder. For example: A tree is always planar, which means you can easily show it on a screen without crossing lines, something that is not guaranteed for DAGs. Enumerating all elements without repeating any is a lot harder. Objects can be reachable through two completely distinct paths - which is of course the goal, but it means you have to look at the graph to answer the question “do these two paths end at the same object”, which is trivial with trees.
Most importantly, I know no really good UI to show and manipulate DAGs. For example, the opposite to a tree structure in forums is not a DAG, it’s a simple list ordered by time, and the relationships between posts have to be shown manually in the text.
(Of course, this does not affect the use of trees in cases where end users don’t see them, for example search trees, heaps and so on.)
What do you think? Due to too much spam, I’ve disabled the comments here for now, but you can reach me on Twitter via @zcochrane.
Written on March 11th, 2014 at 12:33 pm