Code Vigorous


navigation
home
github
mozillians
email
about
Dustin J. Mitchell

Taskcluster Redeployability

19 Jan 2018

Taskcluster To Date

Taskcluster has always been open source: all of our code is on Github, and we get lots of contributions to the various repositories. Some of our libraries and other packages have seen some use outside of a Taskcluster context, too.

But today, Taskcluster is not a project that could practically be used outside of its single incarnation at Mozilla. For example, we hard-code the name taskcluster.net in a number of places, and we include our config in the source-code repositories. There’s no legal or contractual reason someone else could not run their own Taskcluster, but it would be difficult and almost certainly break next time we made a change.

The Mozilla incarnation is open to use by any Mozilla project, although our focus is obviously Firefox and Firefox-related products like Fennec. This was a practical decision: our priority is to migrate Firefox to Taskcluster, and that is an enormous project. Maintaining an abstract ability to deploy additional instances while working on this project was just too much work for a small team.

The good news is, the focus is now shifting. The migration from Buildbot to Taskcluster is nearly complete, and the remaining pieces are related to hardware deployment, largely by other teams. We are returning to work on something we’ve wanted to do for a long time: support redeployability.

Redeployability

Redeployability means that Taskcluster can be deployed easily, multiple times, similar to OpenStack or Hadoop. If, when we finish this effort, there exist several distinct “instances” of Taskcluster in the world, then we have been successful. We will start by building a “staging” deployment of the Firefox instance, then move on to deploy instances that see production usage, first for other projects in Mozilla, and later for projects outside of Mozilla.

In deciding to pursue this approach, we considered three options:

  • Taskcluster as a service (TCaaS) – we run the single global Taskcluster instance, providing that service to other projects just like Github or Travis-CI.
  • Firefox CI – Taskcluster persists only as a one-off tool to support Firefox’s continuous integration system
  • Redeployability (redeployability) – we provide means for other projects to run dedicated Taskcluster instances

TCaaS allows us to provide what we believe is a powerful platform for complex CI systems to a broad audience. While not quite as easy to get started with, Taskcluster’s flexibility extends far beyond what even a paid plan with CircleCI or Travis-CI can provide. However, this approach would represent a new and different business realm for Mozilla. While the organization has lots of public-facing services like MDN and Addons, other organizations do not depend on these services for production usage, nor do they pay us a fee for use of those services. Defining and meeting SLAs, billing, support staffing, abuse response – none of these are areas of expertise within Mozilla, much less the Taskcluster team. TCaaS would also require substantial changes to the platform itself to isolate paying customers from one another, hide confidential data, accurately meter usage, and so on.

Firefox CI is, in a sense, a scaled-back, internal version of TCaaS: we provide a service, but to only one customer (Firefox Engineering). It would mean transitioning the team to an operations focus, with little or no further development on the platform. It would also open the doors to Firefox-specific design within Taskcluster, such as checking out the Gecko source code in the workers or sorting queued tasks by Gecko branch. This would also shut the door to other projects such as Rust relying on Taskcluster.

Redeployability represents something of a compromise between the other two options. It allows us to make Taskcluster available outside of the narrow confines of Firefox CI without diving into a strange new business model. We’re Mozilla – shipping open source software is right in our wheelhouse.

It comes with some clear advantages, too:

  • Like any open-source project, users will contribute back, focusing on the parts of the system most related to their needs. Most Taskcluster users will be medium- to large-scale engineering organizations, and thus able to dedicate the resources to design and develop significant new features.

  • A well-designed deployment system will help us improve operations for Firefox CI (many of our outages today are caused by deployment errors) and enable deployment by teams focused on operations.

  • We can deploy an entire staging instance of Firefox’s Taskcluster, allowing thorough testing before deploying to production. The current approach to staging changes is ad-hoc and differs between services, workers, and libraries.

Challenges

Of course, the redeployability project is not going to be easy. The next few sections highlight some of the design challenges we are facing. We have begin solving all of these and more, but as none of the solutions are set in stone I will focus just on the challenges themselves.

Deployment Process

Deploying a set of microservices and backend services like databases is pretty easy: tools like Kubernetes are designed for the purpose. Taskcluster, however, is a little more complicated. The system uses a number of cloud providers (packet.net, AWS, and Azure), each of which needs to be configured properly before use.

Worker deployment is a complicated topic: workers must be built into images that can run in cloud services (such as AMIs), and those images must be capable of starting and contacting the Queue to fetch work without further manual input. We already support a wide array of worker deployments on the single instance of Taskcluster, and multiple deployments would probably see an even greater diversity, so any deployment system will need to be extremely flexible.

We want to use the deployment process for all deployments, so it must be fast and reliable. For example, to deploy a fix to the Secrets service, I would modify the configuration to point to the new version and initiate a full re-deploy of the Taskcluster instance. If the deployment process causes downtime by restarting every service, or takes hours to complete, we will find ourselves “cheating” and deploying things directly.

Client Libraries

The Taskcluster client libraries contain code that is generated from the API specification for the Taskcluster services. That means that the latest taskcluster package on PyPi corresponds to the APIs of the services as they are currently deployed. If an instance of Taskcluster is running an older version of those services, then the newer client may not be able to call them correctly. Likewise, an instance created for development purposes might have API methods that aren’t defined in any released version of the client libraries.

A related issue is service discovery: how does a client library find the right URL for a particular service? For platform services like the Queue and Auth, this is fairly simple, but grows more complex for services which might be deployed several times, such as the AWS provisioner.

Configuration and Customization

No two deployments of Taskcluster will be exactly alike – that would defeat the purpose. We must support a limited amount of flexibility: which services are enabled, what features of those services are enabled, and credentials for the various cloud services we use.

In some cases the configuration for a service relies on values derived from another service that must already be started. For example, the Queue needs Taskcluster credentials generated by calling createTask on a running Auth service.

Upgrades

Many of the new features we have added in Taskcluster have been deployed through a carefully-choreographed, manual process. For example, to deploy parameterized roles support, which involved a change to the Auth sevice’s backend support, I disabled writes to the backend, carefully copied the data to the new backend, then landed a patch to begin using the new backend with the old frontend, and so on. We cannot expect users to follow hand-written instructions for such delicate dances.

Conclusion

The Taskcluster team has a lot of work to do. But this is a direction many of us have been itching to move for several years now, so we are eagerly jumping into it. Look for more updates on the redeployability project in the coming months!