Setting eager defaults for SQLAlchemy ORM models

Default values. We tend to not really think about them. They casually get applied at just the right time when we persist our objects, and everything is right with the universe. Except maybe, sometimes, we need that default value to be there before we flush to the database. What if we want the default earlier?

All defaults, all the time

Let’s start with something basic, where we try to eagerly apply as many defaults as we can during construction. SQLAlchemy allows for a whole host of different defaults, but briefly summarized, these are broadly what are accepted:

  1. Constant values. Strings, booleans, containers, or any value object
  2. SQL expressions. That are executed during flush (e.g. sqlalchemy.func.now())
  3. Python callables. These can be of two kinds: simple argument-less functions, or ones that are context sensitive, meaning they accept an execution context, which allows access to other columns’ values and various other bits.

During object creation, we don’t actually interact with the database, so SQL expressions are meaningless, and because Python functions will expect a context, it’s easier to just ignore all of them. Constant values it is!

So how do we go about this? Overriding the __init__ method is the obvious first candidate. Unfortunately, that doesn’t work due to the internals of the ORM machinery. Thankfully the SQLAlchemy developers have thought of us and there’s the option to provide an alternative constructor during the creation of the ORM Base class. Using this, let’s define a Base, our User model and a basic alternative constructor:

more ...

Aggregating relationships into JSON objects

In day to day use and operation, we strive for a certain degree of normalization of the data in our database. In reporting though, this normalization causes some friction: we often want the output to contain these duplicates of the data, for sake of row to row completeness. If we’re reporting on top-grossing films in the last decade, it’s easy enough to join the director table and list their name and year of birth. Collecting a set of properties for which we don’t know the names and quantity ahead of time is a little more challenging and interesting.

In our example, we’ll work with a schema that’s a little bit like an Entity–attribute–value model, but with strings for both the attribute and the value to keep things a little simpler. Of course, we’ll pick everyone’s favourite tool for bundling arbitrary data: JSON, and collect these properties into a single object.

Before we get into the actual SQL syntax, let’s describe the schema we’ll be working with:

CREATE TABLE item (
    id SERIAL NOT NULL,
    description TEXT,
    PRIMARY KEY (id))

CREATE TABLE item_property (
    id SERIAL NOT NULL,
    item_id INTEGER,
    label TEXT,
    value TEXT,
    PRIMARY KEY (id),
    CONSTRAINT uq_property_label UNIQUE (item_id, label),
    FOREIGN KEY(item_id) REFERENCES item (id))
more ...

Subset checking in PostgreSQL with SQLAlchemy

Let’s assume we have a system where events are stored for multiple services and tenants. Let’s also assume that our fictional system has a means of updating many of these events at a time, for instance to mark them as unimportant. And for the sake of keeping things relevant, let’s assume that this service is available via some authenticated public API.

Given all of the above, and the knowledge that we can’t just trust anyone to limit themselves to events that are theirs to edit, we’ll have to verify that all of the events selected for editing are within the scope of editing for the user.

The simplest way to do this would be to load every item from the database and check whether it’s eligible for modification. However, this is something that scales terribly past a few dozen records, so let’s not even consider that.

Set theory to the rescue

If the phrase “verify a set of given IDs are all part of a set of valid IDs” makes you think of sets and subset checking, you already know what this is about. Python has a set type that provides a bunch of useful operations that allow us to check whether a given set A ({1, 4}) has all of its values present in set B ({2, 4, 6}; it does not). We can use this to solve our problem:

user_selected_ids = {1, 4}
permissible_ids_q = session.query(Event.id).filter_by(
    service_id=relevant_service,
    tenant_id=current_tenant)
permissible_event_ids = {row.id for row in permissible_ids_q}
assert user_selected_ids <= permissible_event_ids
more ...

Storing and verifying passwords with SQLAlchemy

I really enjoy abstractions.

Abstractions are the lifeblood of programming. They take complex operations and make them easy to work with through accessible interfaces. This article will be about doing that with the way we store and verify passwords (or rather their cryptographic hashes) in (web-)applications based on SQLAlchemy. And in such a way that we can upgrade the security of passwords as old encryption schemes are broken or proven insufficient and new ones get introduced.

The ways of verifying passwords

There are many different ways to deal with password verification, but they can be classified in three rough categories:

Ad-hoc verification

This approach is often used for one-off scripts or trivial applications that have a single verification and have very little to gain from reusable components. The following shows a very basic User model and password verification code:

import bcrypt
from sqlalchemy import Column, Integer, Text

class User(Base):
    __tablename__ = 'user'
    id = Column(Integer, primary_key=True)
    name = Column(Text)
    password = Column(Text)

pwhash = bcrypt.hashpw(login_data['password'], user.password)
if user.password == pwhash:
    print 'Access granted'

The snippet above uses the Python bcrypt package for key derivation, but you may well use another key derivation method, like PBKDF2 or scrypt. To be sure: It should not have a static salt, and it should definitely not be a single round of optimized-for-speed SHA1 or SHA2 functions. [1] In short, use proven and existing methods.

Assuming that the hashing algorithm is secure, this code is perfectly fine for the stated goal. However, most applications are not trivial and most one-off scripts tend to find their way into repeated use. And over time you find yourself confronted with more and more repeats of the same code to verify a user’s password. Something needs to be done, it’s time to …

more ...

Mordac the (query) preventer

A small update, a much delayed update, but still an update.

A few days ago, cwillu asked the following question in #sqlalchemy [1]:

<cwillu_> is there a way to entirely disable autocommit?
<cwillu_> i.e., any insert outside of an explicit transaction
          will either error out, or just immediately rollback?

What eventually worked for this case was simply disabling autocommit on the connection level. Explicit rollbacks were issued at the end of each test and all worked fine. But given that SQLAlchemy features a pretty nifty event system, I was pretty sure there was a better solution available. Also, I was waiting for an excuse to experiment with that part of SQLAlchemy, and this offered exactly that.

As it turns out, preventing queries from happening is about as straightforward as it gets. A quick look at the list of SQLAlchemy Core events identifies two likely candidates: 'before_cursor_execute' and 'before_execute'. Both events are triggered when queries are executed, but the latter is documented to work on a higher level: it’s signaled before the query is compiled for the appropriate dialect. Given that we want to entirely prevent queries outside of transactions, stopping them before doing any unnecessary work seems best, so we’ll use that one.

When the event triggers, our function is called and the caller provides us with the connection and a few other arguments, which we’ll collect in *args. The connection has a method in_transaction(), which indicates whether a transaction has explicitly been started for it. That is, even with autocommit turned off, it will return False after the first execute() call. This is exactly what we need to know and the last thing to do is raise an appropriate error. And so Mordac the Preventer [2] is born:

more ...

Mutation tracking in nested JSON structures using SQLAlchemy

This is part two of a two-part post on storage of JSON using SQLAlchemy. The first post covered the basics of creating a JSON column type and tracking mutations. In this post, we will continue from there to cover mutation tracking in arbitrarily nested structures.

In the previous post we ended with an example of appending to an existing list. Upon committing the changes in the session and reloading the object, it was shown the appended string had not been stored. This happened because changing the list in-place did not trigger the changed() method of the class MutableDict. Only setting or deleting a key from the dictionary marks it as changed, and marking it as changed upon access (which is all we did on the dictionary itself) would cause far too many updates of the database.

What we wanted (and perhaps expected) is behavior where changing the list marks the dictionary it’s part of as changed. And for completeness, if the dictionary contained a number of nested dictionaries, changing any of them at any level should mark the class MutableDict as changed. To achieve this, we need a solution that consists of the following parts:

  1. Replacement types for list and dict where all methods that change the object in-place flag it as having changed.
  2. A means to propagate the notification of change up to the top so that it reaches the class MutableDict.
  3. Conversion of all mutable types to the defined replacement types. Both when they are added to the existing structure, as well as on load from the database.
more ...

Creating a JSON column type for SQLAlchemy

This is part one of a two-part post on storage of JSON using SQLAlchemy. This post will touch on the basics of creating a JSON column type and tracking mutability, and is mostly a rehash of the SQLAlchemy documentation. The second post will cover the tracking of mutability in arbitrarily nested JSON structures, and goes beyond what is covered in the documentation.

The past weeks have been pretty busy. We moved from Leeuwarden to Hoofddorp, which in and of itself went pretty smooth, but the new apartment is still very much unfinished. In between work and getting the apartment in order, there hasn’t been a lot of time to spend on side projects, but this seemed interesting enough.

A while ago I needed to store a relatively small amount of variable-format data. The requirements are only to store the data, not search in parts of it or anything else. Additionally, it’s of such a small volume that it doesn’t warrant setting up a document database like MongoDB next to the existing SQLA+Postgres setup. The data itself consisted of a flat mapping of strings to basic scalar types. Something roughly like this:

{
    'title': 'Example data object',
    'description': 'Information about this example object',
    'storage_location': 'Top shelf in the back',
}
more ...