Setting eager defaults for SQLAlchemy ORM models

Default values. We tend to not really think about them. They casually get applied at just the right time when we persist our objects, and everything is right with the universe. Except maybe, sometimes, we need that default value to be there before we flush to the database. What if we want the default earlier?

All defaults, all the time

Let’s start with something basic, where we try to eagerly apply as many defaults as we can during construction. SQLAlchemy allows for a whole host of different defaults, but briefly summarized, these are broadly what are accepted:

  1. Constant values. Strings, booleans, containers, or any value object
  2. SQL expressions. That are executed during flush (e.g. sqlalchemy.func.now())
  3. Python callables. These can be of two kinds: simple argument-less functions, or ones that are context sensitive, meaning they accept an execution context, which allows access to other columns’ values and various other bits.

During object creation, we don’t actually interact with the database, so SQL expressions are meaningless, and because Python functions will expect a context, it’s easier to just ignore all of them. Constant values it is!

So how do we go about this? Overriding the __init__ method is the obvious first candidate. Unfortunately, that doesn’t work due to the internals of the ORM machinery. Thankfully the SQLAlchemy developers have thought of us and there’s the option to provide an alternative constructor during the creation of the ORM Base class. Using this, let’s define a Base, our User model and a basic alternative constructor:

more ...

Merging sorted lists

Recently I was asked to sketch out a function that combines two sorted lists into a single (sorted) list. It’s a fun question with a couple of different answers, potentially all of them correct but with different pros and cons. There are two basic and one-line solutions that come from the Python standard library, so we’ll tackle those first.

Just sort it

Python’s sorting algorithm, Timsort [1], is pretty amazing. It’s stable (part of its mergesort heritage) and is designed to deal particularly well with partially sorted data. Which is exactly what we’ve been given: two runs that need to be merged. In fact, what we’ve been tasked with is the merge step of mergesort. So solution #1 is as simple as:

def combine_sorted(left, right):
    return sorted(left + right)

This is easy to understand and obviously correct (we can’t really mess up sorting), as well as very succinct. However, it has a significant memory requirement as it copies both lists into a new one. Also, isn’t sorting going to get really slow with large lists? (spoiler: no it won’t, it’ll scale almost linearly.)

Merge them like heaps

Heaps are great! Also, sorted lists are heaps and Python has a wonderful module for dealing with heaps. In particular here we’re interested in the heapq.merge function. This takes any number of iterables and returns an iterator over the sorted output:

from heapq import merge

def combine_sorted(left, right):
    return merge(left, right)

This solution is again easy to comprehend, and with some knowledge of the documentation and knowing we are using the function within its contract, it’s obviously correct. As a bonus we can combine infinitely long iterators without expending any additional memory beyond a small constant. If …

more ...

Aggregating relationships into JSON objects

In day to day use and operation, we strive for a certain degree of normalization of the data in our database. In reporting though, this normalization causes some friction: we often want the output to contain these duplicates of the data, for sake of row to row completeness. If we’re reporting on top-grossing films in the last decade, it’s easy enough to join the director table and list their name and year of birth. Collecting a set of properties for which we don’t know the names and quantity ahead of time is a little more challenging and interesting.

In our example, we’ll work with a schema that’s a little bit like an Entity–attribute–value model, but with strings for both the attribute and the value to keep things a little simpler. Of course, we’ll pick everyone’s favourite tool for bundling arbitrary data: JSON, and collect these properties into a single object.

Before we get into the actual SQL syntax, let’s describe the schema we’ll be working with:

CREATE TABLE item (
    id SERIAL NOT NULL,
    description TEXT,
    PRIMARY KEY (id))

CREATE TABLE item_property (
    id SERIAL NOT NULL,
    item_id INTEGER,
    label TEXT,
    value TEXT,
    PRIMARY KEY (id),
    CONSTRAINT uq_property_label UNIQUE (item_id, label),
    FOREIGN KEY(item_id) REFERENCES item (id))
more ...

Subset checking in PostgreSQL with SQLAlchemy

Let’s assume we have a system where events are stored for multiple services and tenants. Let’s also assume that our fictional system has a means of updating many of these events at a time, for instance to mark them as unimportant. And for the sake of keeping things relevant, let’s assume that this service is available via some authenticated public API.

Given all of the above, and the knowledge that we can’t just trust anyone to limit themselves to events that are theirs to edit, we’ll have to verify that all of the events selected for editing are within the scope of editing for the user.

The simplest way to do this would be to load every item from the database and check whether it’s eligible for modification. However, this is something that scales terribly past a few dozen records, so let’s not even consider that.

Set theory to the rescue

If the phrase “verify a set of given IDs are all part of a set of valid IDs” makes you think of sets and subset checking, you already know what this is about. Python has a set type that provides a bunch of useful operations that allow us to check whether a given set A ({1, 4}) has all of its values present in set B ({2, 4, 6}; it does not). We can use this to solve our problem:

user_selected_ids = {1, 4}
permissible_ids_q = session.query(Event.id).filter_by(
    service_id=relevant_service,
    tenant_id=current_tenant)
permissible_event_ids = {row.id for row in permissible_ids_q}
assert user_selected_ids <= permissible_event_ids
more ...

Storing and verifying passwords with SQLAlchemy

I really enjoy abstractions.

Abstractions are the lifeblood of programming. They take complex operations and make them easy to work with through accessible interfaces. This article will be about doing that with the way we store and verify passwords (or rather their cryptographic hashes) in (web-)applications based on SQLAlchemy. And in such a way that we can upgrade the security of passwords as old encryption schemes are broken or proven insufficient and new ones get introduced.

The ways of verifying passwords

There are many different ways to deal with password verification, but they can be classified in three rough categories:

Ad-hoc verification

This approach is often used for one-off scripts or trivial applications that have a single verification and have very little to gain from reusable components. The following shows a very basic User model and password verification code:

import bcrypt
from sqlalchemy import Column, Integer, Text

class User(Base):
    __tablename__ = 'user'
    id = Column(Integer, primary_key=True)
    name = Column(Text)
    password = Column(Text)

pwhash = bcrypt.hashpw(login_data['password'], user.password)
if user.password == pwhash:
    print 'Access granted'

The snippet above uses the Python bcrypt package for key derivation, but you may well use another key derivation method, like PBKDF2 or scrypt. To be sure: It should not have a static salt, and it should definitely not be a single round of optimized-for-speed SHA1 or SHA2 functions. [1] In short, use proven and existing methods.

Assuming that the hashing algorithm is secure, this code is perfectly fine for the stated goal. However, most applications are not trivial and most one-off scripts tend to find their way into repeated use. And over time you find yourself confronted with more and more repeats of the same code to verify a user’s password. Something needs to be done, it’s time to …

more ...

Creating a mostly-random color generator

Grid of randomly colored hexagons

Fully randomized colors

A coloring like this may be desired for some situations, but for most it’s too random.

In the previous two posts we’ve explored how to draw and tile hexagons, creating images that can be seamlessly repeated. We also briefly covered coloring them using a random color generator. The coloring process itself worked fine, but many times the created tiling turned out not very visually appealing, the colors of too harsh a contrast in tone and brightness.

In this post we’ll cover the creation of a mostly-random color generator. One that creates random colors within a certain fraction of the available colorspace.

Picking a color representation

There are a few different ways to describe colors in RGB colorspace, providing us with different approaches on how to restrict the available portion to select colors randomly from:

  • RGB: Red, Green and Blue values — A direct representation of the intensity of each of the color components.
  • HSV: Hue, Saturation, Value (brightness) — A cylindrical color mapping, common in color wheels in various graphics programs
  • HSL: Hue, Saturation and Lightness — Another cylindrical color mapping, similar to HSV but with a few different behaviors which we’ll discuss in a moment

Any of the above could be used, and all three have potentially interesting behavior when certain values are kept constant, or only allowed to vary by a small amount. If we would like a behavior where we can restrict the hue of the color but have full variation in lightness and color intensity, the direct RGB mode is ruled out.

The choice between HSV and HSL mostly comes down to a matter of taste. Intuitively, when I read a color where all three channels are maxed, I would expect that to be a fully saturated and maximally intense color. HSV …

more ...