This is part 2 of a deep-dive into the implementation details of Taskcluster’s backend data stores. Check out part 1 for the background, as we’ll jump right in here!
Azure in Postgres
As of the end of April, we had all of our data in a Postgres database, but the data was pretty ugly. For example, here’s a record of a worker as recorded by worker-manager:
partition_key | testing!2Fstatic-workers
row_key | cc!2Fdd~ee!2Fff
value | {
"state": "requested",
"RowKey": "cc!2Fdd~ee!2Fff",
"created": "2015-12-17T03:24:00.000Z",
"expires": "3020-12-17T03:24:00.000Z",
"capacity": 2,
"workerId": "ee/ff",
"providerId": "updated",
"lastChecked": "2017-12-17T03:24:00.000Z",
"workerGroup": "cc/dd",
"PartitionKey": "testing!2Fstatic-workers",
"lastModified": "2016-12-17T03:24:00.000Z",
"workerPoolId": "testing/static-workers",
"__buf0_providerData": "eyJzdGF0aWMiOiJ0ZXN0ZGF0YSJ9Cg==",
"__bufchunks_providerData": 1
}
version | 1
etag | 0f6e355c-0e7c-4fe5-85e3-e145ac4a4c6c
To reap the goodness of a relational database, that would be a “normal”[*] table: distinct columns, nice data types, and a lot less redundancy.
All access to this data is via some Azure-shaped stored functions, which are also not amenable to the kinds of flexible data access we need:
<tableName>_load
- load a single row<tableName>_create
- create a new row<tableName>_remove
- remove a row<tableName>_modify
- modify a row<tableName>_scan
- return some or all rows in the table
[*] In the normal sense of the word – we did not attempt to apply database normalization.
Database Migrations
So the next step, which we dubbed “phase 2”, was to migrate this schema to one more appropriate to the structure of the data.
The typical approach is to use database migrations for this kind of work, and there are lots of tools for the purpose. For example, Alembic and Django both provide robust support for database migrations – but they are both in Python.
The only mature JS tool is knex, and after some analysis we determined that it both lacked features we needed and brought a lot of additional features that would complicate our usage. It is primarily a “query builder”, with basic support for migrations. Because we target Postgres directly, and because of how we use stored functions, a query builder is not useful. And the migration support in knex, while effective, does not support the more sophisticated approaches to avoiding downtime outlined below.
We elected to roll our own tool, allowing us to get exactly the behavior we wanted.
Migration Scripts
Taskcluster defines a sequence of numbered database versions. Each version corresponds to a specific database schema, which includes the structure of the database tables as well as stored functions. The YAML file for each version specifies a script to upgrade from the previous version, and a script to downgrade back to that version. For example, an upgrade script might add a new column to a table, with the corresponding downgrade dropping that column.
version: 29
migrationScript: |-
begin
alter table secrets add column last_used timestamptz;
end
downgradeScript: |-
begin
alter table secrets drop column last_used;
end
So far, this is a pretty normal approach to migrations. However, a major drawback is that it requires careful coordination around the timing of the migration and deployment of the corresponding code. Continuing the example of adding a new column, if the migration is deployed first, then the existing code may execute INSERT queries that omit the new column. If the new code is deployed first, then it will attempt to read a column that does not yet exist.
There are workarounds for these issues. In this example, adding a default value for the new column in the migration, or writing the queries such that they are robust to a missing column. Such queries are typically spread around the codebase, though, and it can be difficult to ensure (by testing, of course) that they all operate correctly.
In practice, most uses of database migrations are continuously-deployed applications – a single website or application server, where the developers of the application control the timing of deployments. That allows a great deal of control, and changes can be spread out over several migrations that occur in rapid succession.
Taskcluster is not continuously deployed – it is released in distinct versions which users can deploy on their own cadence. So we need a way to run migrations when upgrading to a new Taskcluster release, without breaking running services.
Stored Functions
Part 1 mentioned that all access to data is via stored functions. This is the critical point of abstraction that allows smooth migrations, because stored functions can be changed at runtime.
Each database version specifies definitions for stored functions, either introducing new functions or replacing the implementation of existing functions.
So the version: 29
YAML above might continue with
methods:
create_secret:
args: name text, value jsonb
returns: ''
body: |-
begin
insert
into secrets (name, value, last_used)
values (name, value, now());
end
get_secret:
args: name text
returns: record
body: |-
begin
update secrets
set last_used = now()
where secrets.name = get_secret.name;
return query
select name, value from secrets
where secrets.name = get_secret.name;
end
This redefines two existing functions to operate properly against the new table.
The functions are redefined in the same database transaction as the migrationScript
above, meaning that any calls to create_secret
or get_secret
will immediately begin populating the new column.
A critical rule (enforced in code) is that the arguments and return type of a function cannot be changed.
To support new code that references the last_used
value, we add a new function:
get_secret_with_last_used:
args: name text
returns: record
body: |-
begin
update secrets
set last_used = now()
where secrets.name = get_secret.name;
return query
select name, value, last_used from secrets
where secrets.name = get_secret.name;
end
Another critical rule is that DB migrations must be applied fully before the corresponding version of the JS code is deployed.
In this case, that means code that uses get_secret_with_last_used
is deployed only after the function is defined.
All of this can be thoroughly tested in isolation from the rest of the Taskcluster code, both unit tests for the functions and integration tests for the upgrade and downgrade scripts. Unit tests for redefined functions should continue to pass, unchanged, providing an easy-to-verify compatibility check.
Phase 2 Migrations
The migrations from Azure-style tables to normal tables are, as you might guess, a lot more complex than this simple example. Among the issues we faced:
- Azure-entities uses a multi-field base64 encoding for many data-types, that must be decoded (such as
__buf0_providerData
/__bufchunks_providerData
in the example above) - Partition and row keys are encoded using a custom variant of urlencoding that is remarkably difficult to implement in pl/pgsql
- Some columns (such as secret values) are encrypted.
- Postgres generates slightly different ISO8601 timestamps from JS’s
Date.toJSON()
We split the work of performing these migrations across the members of the Taskcluster team, supporting each other through the tricky bits, in a rather long but ultimately successful “Postgres Phase 2” sprint.
0042 - secrets phase 2
Let’s look at one of the simpler examples: the secrets service.
The migration script creates a new secrets
table from the data in the secrets_entities
table, using Postgres’s JSON function to unpack the value
column into “normal” columns.
The database version YAML file carefully redefines the Azure-compatible DB functions to access the new secrets
table.
This involves unpacking function arguments from their JSON formats, re-packing JSON blobs for return values, and even some light parsing of the condition string for the secrets_entities_scan
function.
It then defines new stored functions for direct access to the normal table. These functions are typically similar, and more specific to the needs of the service. For example, the secrets service only modifies secrets in an “upsert” operation that replaces any existing secret of the same name.
Step By Step
To achieve an extra layer of confidence in our work, we landed all of the phase-2 PRs in two steps. The first step included migration and downgrade scripts and the redefined stored functions, as well as tests for those. But critically, this step did not modify the service using the table (the secrets service in this case). So the unit tests for that service use the redefined stored functions, acting as a kind of integration-test for their implementations. This also validates that the service will continue to run in production between the time the database migration is run and the time the new code is deployed. We landed this step on GitHub in such a way that reviewers could see a green check-mark on the step-1 commit.
In the second step, we added the new, purpose-specific stored functions and rewrote the service to use them. In services like secrets, this was a simple change, but some other services saw more substantial rewrites due to more complex requirements.
Deprecation
Naturally, we can’t continue to support old functions indefinitely: eventually they would be prohibitively complex or simply impossible to implement. Another deployment rule provides a critical escape from this trap: Taskcluster must be upgraded at most one major version at a time (e.g., 36.x to 37.x). That provides a limited window of development time during which we must maintain compatibility.
Defining that window is surprisingly tricky, but essentially it’s two major revisions. Like the software engineers we are, we packaged up that tricky computation in a script, and include the lifetimes in some generated documentation
What’s Next?
This post has hinted at some of the complexity of “phase 2”. There are lots of details omitted, of course!
But there’s one major detail that got us in a bit of trouble.
In fact, we were forced to roll back during a planned migration – not an engineer’s happiest moment.
The queue_tasks_entities
and queue_artifacts_entities
table were just too large to migrate in any reasonable amount of time.
Part 3 will describe what happened, how we fixed the issue, and what we’re doing to avoid having the same issue again.