In this third article in what is rapidly becoming a series on binary trees, we’ll have a look at another way of constructing generic binary trees from a serialised format. For this, we’ll build on some of the techniques and insights from the previous articles:

  1. Reconstructing binary trees from traversals: the initial post which dealt with constructing binary trees from pre-order and in-order sequences
  2. Creating a forest from a single seed followed up by creating different trees conforming to the same pre-order sequence.

In the first article we covered how a generic (unordered) binary tree cannot be constructed from a single depth-first search sequentialisation. Lacking structural information, a second sequentialisation is required, in-order with one or pre or post-order. This is all well and good when the values in the tree are small (measured in bytes), but as they get larger, there is a considerable overhead in this duplication of values. Compression will certainly help, but is unlikely to remove it entirely.

The second article introduced the notion of attachment loci of a tree under construction: the possible locations where the next node/value from a pre-order sequence could be placed. We only needed them for constructing tree permutations at the time, but with a few tweaks, we should be able to use them to provide the structural information to supplement the single sequentialisation.

Relative branch encoding

While the output from the previous function is enough to rebuild the original tree, it doesn’t quite focus on the ease of reconstruction, or the brevity of the communication, which is one of the reasons we started this.

So, before we send this sequentialisation over the wire, there are a number of small tweaks we’ll make:

  1. Change the stream of 2-tuples to a flat stream of values and branch offsets. We trade homogeneous output (2-tuples of branch id and node value) for a reduction of two characters for each tree element. Iteration doesn’t become any harder (shown later) and it allows a minor other optimisation:
  2. Now that we don’t need matched pairs, drop the attachment location of the root node, since there is no actual choice in where to place that;
  3. Change the absolute branch_id to a relative one: 0 for a left child, 1 for a right child and 2 and up to indicate right children after backtracking (and so we still get our numbering to start from the left).
def dfs_relative(root):
    def _relative_traverser(traversal):
        last_branch_id = 0
        for branch_id, value in traversal:
            delta, last_branch_id = 1 + last_branch_id - branch_id, branch_id
            yield delta
            yield value

    return islice(_relative_traverser(dfs_structural(root)), 1, None)

The inner function takes the iterator returned by dfs_structural and turns it into a relative one by tracking the difference between the current and last branch_id, turning the absolute values into just the changes. Instead of returning tuples it returns the delta and node value in sequence.

The final statement reads a bit terse, but it returns the absolute-to-relative conversion of the traversal, skipping the first value from the resulting iterator. When captured into a list, our tree from before is now represented as ["G", 0, "O", 0, "R", 2, "Y", 2, "B", 1, "V"].

Tree reconstruction

Constructing a tree from this stream is pretty straightforward, the only “trick” we need is a pairwise iterator, which is easily achieved by taking a single iterator from the pre-order sequence and zipping it on itself:

def tree_reconstruct(structural_preorder):
    ivalues = iter(structural_preorder)
    root = node = Node(next(ivalues))
    stack = []
    for branch_id, value in zip(ivalues, ivalues):
        if branch_id == 0:
            stack.append(node)
            node.left = node = Node(value)
        else:
            for _ in range(1, branch_id):
                node = stack.pop()
            node.right = node = Node(value)
    return root

Like our previous iterative functions that operate on or create a tree, we maintain a small stack to account for backtracking. If the relative position is 0 we attach it on the left. In other cases we backtrack 0 or more steps (note that range(1, 1) is an empty range) and then attach on the right. Once all values have been processed, the construction is done and the root is returned.

The series of trees below provide a visualisation of the construction process. Each graph shows the tree at the start of a loop iteration, before inserting the next node. Each possible attachment locus is indicated with a number, according to the relative branch numbering scheme.

The tree construction process, made visible with the intermediate trees and the relative branch_id of each attachment locus.

The tree construction process.

Given the sequence ["G", 0, "O", 0, "R", 2, "Y", 2, "B", 1, "V"], this visualises the attachment loci after each ndoe attachment, building up the tree one node at a time.

So, did it work?

A tree with two 2-node legs, one with repeating values "L", the other "R".

A tree with non-distinct elements, created from its uniquely identifying sequence

One of the goals we set out with was to have a minimal serialised size for something like a JSON transport of a tree. Given some smaller trees consisting of 8-character words, the space savings amount to roughly 40%, though this depends greatly on the length of the values. Worst case we achieve parity (minus a handful of bytes).

What about compression though, how much will that affect the results? A little bit but not a whole lot. Serialising a 1000-node, mostly balanced, tree, where each node contains an 8-character word as its value, the following results are achieved:

  • JSON with pre-order and in-order sequences
    • plaintext: 22.026 bytes
    • gzipped: 7.638 bytes (2.9x compression ratio)
  • JSON with relative branch-encoded sequence
    • plaintext: 13.012 bytes (41% smaller)
    • gzipped: 5.663 bytes (2.3x compression ratio, still 25% smaller)

Another significant benefit over the dual-sequence method of construction is that with this method, the tree is not restricted to distinct values. Because the structural information is encoded in the sequence and provides complete information for node placement, repeated values do not create ambiguity in the reconstructed tree. The graph on the right shows the tree constructed from ["P", 0, "L", 0, "L", 3, "R", 1, "R"]. Attempting to construct this tree from in-order and pre-order sequences would have four equally possible outcomes without a way to determine the correct one. The algorithm from the first post would place the second L on the right of the first, for example.


Comments

comments powered by Disqus