Creating a forest from a single seed

In the previous post we explored recreating a binary tree from a pair of sequentialisations. We needed a pair of them because any single sequence by itself doesn’t uniquely describe the tree, without some additional bit of information, the sequence itself leaves a certain level of ambiguity.

But exactly how ambiguous is a single traversal result? How many different trees can we make that fit a given sequence in isolation? What sort of structure is there in them? Fun questions we can answer with code!

Seeing the trees

At this point, before we attempt to create our own forest of binary trees, it’s a good time to look into visualising the trees we plan on making. At its core, a binary tree is a specific type of graph, and there are a ton of tools out there to visualise graphs. One of the more popular open source solutions is the excellent GraphViz. There are various Python packges that provide an interface for it, with pros and cons to all of them, a review of which is well outside the scope of this post. So in short, we’ll be using PyDot, which creates graphs in GraphViz’ Dot format, which we can then have rendered to various image formats.

more ...

Reconstructing binary trees from traversals

Between work, play and other side projects, I’ve been playing around with binary trees. They’re a hugely useful data structure in many fields, but more relevantly in this case, they lend themselves to all sorts of noodling and tinkering. Among the many possible questions is the seemingly simple “how do you serialise a tree?” for inter-process communication, a web-based API or just for the hell of it. And given such a serialisation, how do you reconstruct the original tree from it?

One way is to express each node as a list of three elements: the value of the node, its left child and its right child. Each of these children is its own three-element list (or some form of empty value) and on and on it goes. The end of this serialisation will have a particular signature: ]]]]]]]. Something we have seen before.

I have nothing against brackets per se, but as some have said: “flat is better than nested.” [1] And there are wonderfully flat ways of describing binary trees. This article will cover one such way of representing as, and then constructing a binary tree from a pair of lists, without the need for additional structural information.

more ...

Setting eager defaults for SQLAlchemy ORM models

Default values. We tend to not really think about them. They casually get applied at just the right time when we persist our objects, and everything is right with the universe. Except maybe, sometimes, we need that default value to be there before we flush to the database. What if we want the default earlier?

All defaults, all the time

Let’s start with something basic, where we try to eagerly apply as many defaults as we can during construction. SQLAlchemy allows for a whole host of different defaults, but briefly summarized, these are broadly what are accepted:

  1. Constant values. Strings, booleans, containers, or any value object
  2. SQL expressions. That are executed during flush (e.g. sqlalchemy.func.now())
  3. Python callables. These can be of two kinds: simple argument-less functions, or ones that are context sensitive, meaning they accept an execution context, which allows access to other columns’ values and various other bits.

During object creation, we don’t actually interact with the database, so SQL expressions are meaningless, and because Python functions will expect a context, it’s easier to just ignore all of them. Constant values it is!

So how do we go about this? Overriding the __init__ method is the obvious first candidate. Unfortunately, that doesn’t work due to the internals of the ORM machinery. Thankfully the SQLAlchemy developers have thought of us and there’s the option to provide an alternative constructor during the creation of the ORM Base class. Using this, let’s define a Base, our User model and a basic alternative constructor:

more ...

Merging sorted lists

Recently I was asked to sketch out a function that combines two sorted lists into a single (sorted) list. It’s a fun question with a couple of different answers, potentially all of them correct but with different pros and cons. There are two basic and one-line solutions that come from the Python standard library, so we’ll tackle those first.

Just sort it

Python’s sorting algorithm, Timsort [1], is pretty amazing. It’s stable (part of its mergesort heritage) and is designed to deal particularly well with partially sorted data. Which is exactly what we’ve been given: two runs that need to be merged. In fact, what we’ve been tasked with is the merge step of mergesort. So solution #1 is as simple as:

def combine_sorted(left, right):
    return sorted(left + right)

This is easy to understand and obviously correct (we can’t really mess up sorting), as well as very succinct. However, it has a significant memory requirement as it copies both lists into a new one. Also, isn’t sorting going to get really slow with large lists? (spoiler: no it won’t, it’ll scale almost linearly.)

Merge them like heaps

Heaps are great! Also, sorted lists are heaps and Python has a wonderful module for dealing with heaps. In particular here we’re interested in the heapq.merge function. This takes any number of iterables and returns an iterator over the sorted output:

from heapq import merge

def combine_sorted(left, right):
    return merge(left, right)

This solution is again easy to comprehend, and with some knowledge of the documentation and knowing we are using the function within its contract, it’s obviously correct. As a bonus we can combine infinitely long iterators without expending any additional memory beyond a small constant. If …

more ...

Aggregating relationships into JSON objects

In day to day use and operation, we strive for a certain degree of normalization of the data in our database. In reporting though, this normalization causes some friction: we often want the output to contain these duplicates of the data, for sake of row to row completeness. If we’re reporting on top-grossing films in the last decade, it’s easy enough to join the director table and list their name and year of birth. Collecting a set of properties for which we don’t know the names and quantity ahead of time is a little more challenging and interesting.

In our example, we’ll work with a schema that’s a little bit like an Entity–attribute–value model, but with strings for both the attribute and the value to keep things a little simpler. Of course, we’ll pick everyone’s favourite tool for bundling arbitrary data: JSON, and collect these properties into a single object.

Before we get into the actual SQL syntax, let’s describe the schema we’ll be working with:

CREATE TABLE item (
    id SERIAL NOT NULL,
    description TEXT,
    PRIMARY KEY (id))

CREATE TABLE item_property (
    id SERIAL NOT NULL,
    item_id INTEGER,
    label TEXT,
    value TEXT,
    PRIMARY KEY (id),
    CONSTRAINT uq_property_label UNIQUE (item_id, label),
    FOREIGN KEY(item_id) REFERENCES item (id))
more ...

Subset checking in PostgreSQL with SQLAlchemy

Let’s assume we have a system where events are stored for multiple services and tenants. Let’s also assume that our fictional system has a means of updating many of these events at a time, for instance to mark them as unimportant. And for the sake of keeping things relevant, let’s assume that this service is available via some authenticated public API.

Given all of the above, and the knowledge that we can’t just trust anyone to limit themselves to events that are theirs to edit, we’ll have to verify that all of the events selected for editing are within the scope of editing for the user.

The simplest way to do this would be to load every item from the database and check whether it’s eligible for modification. However, this is something that scales terribly past a few dozen records, so let’s not even consider that.

Set theory to the rescue

If the phrase “verify a set of given IDs are all part of a set of valid IDs” makes you think of sets and subset checking, you already know what this is about. Python has a set type that provides a bunch of useful operations that allow us to check whether a given set A ({1, 4}) has all of its values present in set B ({2, 4, 6}; it does not). We can use this to solve our problem:

user_selected_ids = {1, 4}
permissible_ids_q = session.query(Event.id).filter_by(
    service_id=relevant_service,
    tenant_id=current_tenant)
permissible_event_ids = {row.id for row in permissible_ids_q}
assert user_selected_ids <= permissible_event_ids
more ...