Constructing binary trees using relative branch encoding

In this third article in what is rapidly becoming a series on binary trees, we’ll have a look at another way of constructing generic binary trees from a serialised format. For this, we’ll build on some of the techniques and insights from the previous articles:

  1. Reconstructing binary trees from traversals: the initial post which dealt with constructing binary trees from pre-order and in-order sequences
  2. Creating a forest from a single seed followed up by creating different trees conforming to the same pre-order sequence.

In the first article we covered how a generic (unordered) binary tree cannot be constructed from a single depth-first search sequentialisation. Lacking structural information, a second sequentialisation is required, in-order with one or pre or post-order. This is all well and good when the values in the tree are small (measured in bytes), but as they get larger, there is a considerable overhead in this duplication of values. Compression will certainly help, but is unlikely to remove it entirely.

The second article introduced the notion of attachment loci of a tree under construction: the possible locations where the next node/value from a pre-order sequence could be placed. We only needed them for constructing tree permutations at the time, but with a few tweaks, we should be able to use them to provide the structural information to supplement the single sequentialisation.

more ...

Creating a forest from a single seed

In the previous post we explored recreating a binary tree from a pair of sequentialisations. We needed a pair of them because any single sequence by itself doesn’t uniquely describe the tree, without some additional bit of information, the sequence itself leaves a certain level of ambiguity.

But exactly how ambiguous is a single traversal result? How many different trees can we make that fit a given sequence in isolation? What sort of structure is there in them? Fun questions we can answer with code!

Seeing the trees

At this point, before we attempt to create our own forest of binary trees, it’s a good time to look into visualising the trees we plan on making. At its core, a binary tree is a specific type of graph, and there are a ton of tools out there to visualise graphs. One of the more popular open source solutions is the excellent GraphViz. There are various Python packges that provide an interface for it, with pros and cons to all of them, a review of which is well outside the scope of this post. So in short, we’ll be using PyDot, which creates graphs in GraphViz’ Dot format, which we can then have rendered to various image formats.

more ...

Reconstructing binary trees from traversals

Between work, play and other side projects, I’ve been playing around with binary trees. They’re a hugely useful data structure in many fields, but more relevantly in this case, they lend themselves to all sorts of noodling and tinkering. Among the many possible questions is the seemingly simple “how do you serialise a tree?” for inter-process communication, a web-based API or just for the hell of it. And given such a serialisation, how do you reconstruct the original tree from it?

One way is to express each node as a list of three elements: the value of the node, its left child and its right child. Each of these children is its own three-element list (or some form of empty value) and on and on it goes. The end of this serialisation will have a particular signature: ]]]]]]]. Something we have seen before.

I have nothing against brackets per se, but as some have said: “flat is better than nested.” [1] And there are wonderfully flat ways of describing binary trees. This article will cover one such way of representing as, and then constructing a binary tree from a pair of lists, without the need for additional structural information.

more ...

Setting eager defaults for SQLAlchemy ORM models

Default values. We tend to not really think about them. They casually get applied at just the right time when we persist our objects, and everything is right with the universe. Except maybe, sometimes, we need that default value to be there before we flush to the database. What if we want the default earlier?

All defaults, all the time

Let’s start with something basic, where we try to eagerly apply as many defaults as we can during construction. SQLAlchemy allows for a whole host of different defaults, but briefly summarized, these are broadly what are accepted:

  1. Constant values. Strings, booleans, containers, or any value object
  2. SQL expressions. That are executed during flush (e.g. sqlalchemy.func.now())
  3. Python callables. These can be of two kinds: simple argument-less functions, or ones that are context sensitive, meaning they accept an execution context, which allows access to other columns’ values and various other bits.

During object creation, we don’t actually interact with the database, so SQL expressions are meaningless, and because Python functions will expect a context, it’s easier to just ignore all of them. Constant values it is!

So how do we go about this? Overriding the __init__ method is the obvious first candidate. Unfortunately, that doesn’t work due to the internals of the ORM machinery. Thankfully the SQLAlchemy developers have thought of us and there’s the option to provide an alternative constructor during the creation of the ORM Base class. Using this, let’s define a Base, our User model and a basic alternative constructor:

more ...

Merging sorted lists

Recently I was asked to sketch out a function that combines two sorted lists into a single (sorted) list. It’s a fun question with a couple of different answers, potentially all of them correct but with different pros and cons. There are two basic and one-line solutions that come from the Python standard library, so we’ll tackle those first.

Just sort it

Python’s sorting algorithm, Timsort [1], is pretty amazing. It’s stable (part of its mergesort heritage) and is designed to deal particularly well with partially sorted data. Which is exactly what we’ve been given: two runs that need to be merged. In fact, what we’ve been tasked with is the merge step of mergesort. So solution #1 is as simple as:

def combine_sorted(left, right):
    return sorted(left + right)

This is easy to understand and obviously correct (we can’t really mess up sorting), as well as very succinct. However, it has a significant memory requirement as it copies both lists into a new one. Also, isn’t sorting going to get really slow with large lists? (spoiler: no it won’t, it’ll scale almost linearly.)

Merge them like heaps

Heaps are great! Also, sorted lists are heaps and Python has a wonderful module for dealing with heaps. In particular here we’re interested in the heapq.merge function. This takes any number of iterables and returns an iterator over the sorted output:

from heapq import merge

def combine_sorted(left, right):
    return merge(left, right)

This solution is again easy to comprehend, and with some knowledge of the documentation and knowing we are using the function within its contract, it’s obviously correct. As a bonus we can combine infinitely long iterators without expending any additional memory beyond a small constant. If …

more ...

Aggregating relationships into JSON objects

In day to day use and operation, we strive for a certain degree of normalization of the data in our database. In reporting though, this normalization causes some friction: we often want the output to contain these duplicates of the data, for sake of row to row completeness. If we’re reporting on top-grossing films in the last decade, it’s easy enough to join the director table and list their name and year of birth. Collecting a set of properties for which we don’t know the names and quantity ahead of time is a little more challenging and interesting.

In our example, we’ll work with a schema that’s a little bit like an Entity–attribute–value model, but with strings for both the attribute and the value to keep things a little simpler. Of course, we’ll pick everyone’s favourite tool for bundling arbitrary data: JSON, and collect these properties into a single object.

Before we get into the actual SQL syntax, let’s describe the schema we’ll be working with:

CREATE TABLE item (
    id SERIAL NOT NULL,
    description TEXT,
    PRIMARY KEY (id))

CREATE TABLE item_property (
    id SERIAL NOT NULL,
    item_id INTEGER,
    label TEXT,
    value TEXT,
    PRIMARY KEY (id),
    CONSTRAINT uq_property_label UNIQUE (item_id, label),
    FOREIGN KEY(item_id) REFERENCES item (id))
more ...