This is part 2 of a deep-dive into the implementation details of Taskcluster’s backend data stores. Check out part 1 for the background, as we’ll jump right in here!

Azure in Postgres

As of the end of April, we had all of our data in a Postgres database, but the data was pretty ugly. For example, here’s a record of a worker as recorded by worker-manager:

partition_key | testing!2Fstatic-workers
row_key       | cc!2Fdd~ee!2Fff
value         | {
    "state": "requested",
    "RowKey": "cc!2Fdd~ee!2Fff",
    "created": "2015-12-17T03:24:00.000Z",
    "expires": "3020-12-17T03:24:00.000Z",
    "capacity": 2,
    "workerId": "ee/ff",
    "providerId": "updated",
    "lastChecked": "2017-12-17T03:24:00.000Z",
    "workerGroup": "cc/dd",
    "PartitionKey": "testing!2Fstatic-workers",
    "lastModified": "2016-12-17T03:24:00.000Z",
    "workerPoolId": "testing/static-workers",
    "__buf0_providerData": "eyJzdGF0aWMiOiJ0ZXN0ZGF0YSJ9Cg==",
    "__bufchunks_providerData": 1
  }
version       | 1
etag          | 0f6e355c-0e7c-4fe5-85e3-e145ac4a4c6c

To reap the goodness of a relational database, that would be a “normal”[*] table: distinct columns, nice data types, and a lot less redundancy.

All access to this data is via some Azure-shaped stored functions, which are also not amenable to the kinds of flexible data access we need:

<tableName>_load - load a single row
<tableName>_create - create a new row
<tableName>_remove - remove a row
<tableName>_modify - modify a row
<tableName>_scan - return some or all rows in the table

[*] In the normal sense of the word – we did not attempt to apply database normalization.

Database Migrations

So the next step, which we dubbed “phase 2”, was to migrate this schema to one more appropriate to the structure of the data.

The typical approach is to use database migrations for this kind of work, and there are lots of tools for the purpose. For example, Alembic and Django both provide robust support for database migrations – but they are both in Python.

The only mature JS tool is knex, and after some analysis we determined that it both lacked features we needed and brought a lot of additional features that would complicate our usage. It is primarily a “query builder”, with basic support for migrations. Because we target Postgres directly, and because of how we use stored functions, a query builder is not useful. And the migration support in knex, while effective, does not support the more sophisticated approaches to avoiding downtime outlined below.

We elected to roll our own tool, allowing us to get exactly the behavior we wanted.

Migration Scripts

Taskcluster defines a sequence of numbered database versions. Each version corresponds to a specific database schema, which includes the structure of the database tables as well as stored functions. The YAML file for each version specifies a script to upgrade from the previous version, and a script to downgrade back to that version. For example, an upgrade script might add a new column to a table, with the corresponding downgrade dropping that column.

version: 29
migrationScript: |-
  begin
    alter table secrets add column last_used timestamptz;
  end
downgradeScript: |-
  begin
    alter table secrets drop column last_used;
  end

So far, this is a pretty normal approach to migrations. However, a major drawback is that it requires careful coordination around the timing of the migration and deployment of the corresponding code. Continuing the example of adding a new column, if the migration is deployed first, then the existing code may execute INSERT queries that omit the new column. If the new code is deployed first, then it will attempt to read a column that does not yet exist.

There are workarounds for these issues. In this example, adding a default value for the new column in the migration, or writing the queries such that they are robust to a missing column. Such queries are typically spread around the codebase, though, and it can be difficult to ensure (by testing, of course) that they all operate correctly.

In practice, most uses of database migrations are continuously-deployed applications – a single website or application server, where the developers of the application control the timing of deployments. That allows a great deal of control, and changes can be spread out over several migrations that occur in rapid succession.

Taskcluster is not continuously deployed – it is released in distinct versions which users can deploy on their own cadence. So we need a way to run migrations when upgrading to a new Taskcluster release, without breaking running services.

Stored Functions

Part 1 mentioned that all access to data is via stored functions. This is the critical point of abstraction that allows smooth migrations, because stored functions can be changed at runtime.

Each database version specifies definitions for stored functions, either introducing new functions or replacing the implementation of existing functions. So the version: 29 YAML above might continue with

methods:
  create_secret:
    args: name text, value jsonb
    returns: ''
    body: |-
      begin
        insert
        into secrets (name, value, last_used)
        values (name, value, now());
      end
  get_secret:
    args: name text
    returns: record
    body: |-
      begin
        update secrets
        set last_used = now()
        where secrets.name = get_secret.name;

        return query
        select name, value from secrets
        where secrets.name = get_secret.name;
      end

This redefines two existing functions to operate properly against the new table. The functions are redefined in the same database transaction as the migrationScript above, meaning that any calls to create_secret or get_secret will immediately begin populating the new column. A critical rule (enforced in code) is that the arguments and return type of a function cannot be changed.

To support new code that references the last_used value, we add a new function:

  get_secret_with_last_used:
    args: name text
    returns: record
    body: |-
      begin
        update secrets
        set last_used = now()
        where secrets.name = get_secret.name;

        return query
        select name, value, last_used from secrets
        where secrets.name = get_secret.name;
      end

Another critical rule is that DB migrations must be applied fully before the corresponding version of the JS code is deployed. In this case, that means code that uses get_secret_with_last_used is deployed only after the function is defined.

All of this can be thoroughly tested in isolation from the rest of the Taskcluster code, both unit tests for the functions and integration tests for the upgrade and downgrade scripts. Unit tests for redefined functions should continue to pass, unchanged, providing an easy-to-verify compatibility check.

Phase 2 Migrations

The migrations from Azure-style tables to normal tables are, as you might guess, a lot more complex than this simple example. Among the issues we faced:

Azure-entities uses a multi-field base64 encoding for many data-types, that must be decoded (such as __buf0_providerData/__bufchunks_providerData in the example above)
Partition and row keys are encoded using a custom variant of urlencoding that is remarkably difficult to implement in pl/pgsql
Some columns (such as secret values) are encrypted.
Postgres generates slightly different ISO8601 timestamps from JS’s Date.toJSON()

We split the work of performing these migrations across the members of the Taskcluster team, supporting each other through the tricky bits, in a rather long but ultimately successful “Postgres Phase 2” sprint.

0042 - secrets phase 2

Let’s look at one of the simpler examples: the secrets service. The migration script creates a new secrets table from the data in the secrets_entities table, using Postgres’s JSON function to unpack the value column into “normal” columns.

The database version YAML file carefully redefines the Azure-compatible DB functions to access the new secrets table. This involves unpacking function arguments from their JSON formats, re-packing JSON blobs for return values, and even some light parsing of the condition string for the secrets_entities_scan function.

It then defines new stored functions for direct access to the normal table. These functions are typically similar, and more specific to the needs of the service. For example, the secrets service only modifies secrets in an “upsert” operation that replaces any existing secret of the same name.

Step By Step

To achieve an extra layer of confidence in our work, we landed all of the phase-2 PRs in two steps. The first step included migration and downgrade scripts and the redefined stored functions, as well as tests for those. But critically, this step did not modify the service using the table (the secrets service in this case). So the unit tests for that service use the redefined stored functions, acting as a kind of integration-test for their implementations. This also validates that the service will continue to run in production between the time the database migration is run and the time the new code is deployed. We landed this step on GitHub in such a way that reviewers could see a green check-mark on the step-1 commit.

In the second step, we added the new, purpose-specific stored functions and rewrote the service to use them. In services like secrets, this was a simple change, but some other services saw more substantial rewrites due to more complex requirements.

Deprecation

Naturally, we can’t continue to support old functions indefinitely: eventually they would be prohibitively complex or simply impossible to implement. Another deployment rule provides a critical escape from this trap: Taskcluster must be upgraded at most one major version at a time (e.g., 36.x to 37.x). That provides a limited window of development time during which we must maintain compatibility.

Defining that window is surprisingly tricky, but essentially it’s two major revisions. Like the software engineers we are, we packaged up that tricky computation in a script, and include the lifetimes in some generated documentation

What’s Next?

This post has hinted at some of the complexity of “phase 2”. There are lots of details omitted, of course!

But there’s one major detail that got us in a bit of trouble. In fact, we were forced to roll back during a planned migration – not an engineer’s happiest moment. The queue_tasks_entities and queue_artifacts_entities table were just too large to migrate in any reasonable amount of time. Part 3 will describe what happened, how we fixed the issue, and what we’re doing to avoid having the same issue again.

Code Vigorous

Taskcluster's DB (Part 2) - DB Migrations