Minimizing downtime when deploying to Heroku

Jonah Williams ·

Heroku has become our default hosting platform for most new projects. It’s simple to deploy new rails apps, encourages some good conventions, is able to provide the services most applications need, and can offers a simple scaling solution which can usually handle whatever growth new products are able to produce.

We also like automated and ideally continuous deployments which allow us to push changes live many times per day as we build out a new product. Since we hope any new product is going to have at least a few users it would be nice to avoid or at least minimize taking the site down during those deploys.

Why take down the site during a deploy?

There are two factors which may cause requests to fail during a deploy.

Heroku’s deploy process

First the Heroku deploy process itself may introduce a window of time when requests could time out.

According to Heroku’s preboot documentation

The standard sequence of events when deploying a new app is:

  1. Code push
  2. Run buildpack; store slug
  3. Pause routing service
  4. Kill old dynos
  5. Boot new dynos
  6. Resume routing service

If steps 3 through 6 take too long requests may time out. Additionally if the app cannot respond quickly to requests after booting the new dynos those requests could still timeout once routing has resumed.

This may be a problem for an app but in my experience the elapsed time is short enough that requests are not likely to time out and for most new products allowing requests to spend a several extra seconds in a queue during deploys has not been worth spending time to fix. If it is an issue Heroku offers a beta labs feature called preboot which can eliminate this interval. Preboot introduces some other concerns and is not the focus of this post.

Running DB Migrations

The second, and in my experience most serious, cause of downtime during a deploy is because we intentionally enable maintenance mode around the deploy in order to prevent the app from serving requests while migrations are running. Without this maintenance window requests could reach the application before migrations had finished resulting in application errors or potentially the creation invalid data.

We might design our migrations such that they can all be applied while the site it running. That often involves splitting a migration into multiple steps which must be deployed incrementally. It also requires the support of both the mechanics around deploying a change and the development process around authoring a migration. It’s an interesting topic for further discussion but what could we do to minimize down time without that sort of process change?

A naive Heroku deploy script

Let’s start with a simple deploy script we could use to push to Heroku when our app passes tests in a CI environment.

  1. Enable maintenance mode.
  2. Push the new commit to Heroku.
  3. Run migrations (which may be a no-op).
  4. Disable maintenance mode.

Users will see a maintenance page for a couple of minutes while we deploy changes but for an acceptance environment or experimental product that might be totally acceptable.

One possible issue with this approach; if our app has background processes (like Resque or Sidekiq workers) they might start running before or during database migrations and encounter errors.

Simple improvements

A simple first step is to consider the set of changes we are about to deploy and only enable maintenance mode when it is necessary to run migrations.

We can compare the revision to be deployed with the last deployed commit and determine if any database changes have occurred. For a Rails app changes in the db directory suggest new migrations or changes in seed data which we’ll assume need to be applied before the app can serve new requests.

In addition we can scale any background jobs down before deploying changes and allow them to resume work once migrations have finished.

Now we only see a maintenance page when necessary however in those cases the app may still be in maintenance mode for several minutes.

Saving time; precompiling assets

To further minimize the time spent in maintenance mode we can try to move as much work as possible out of that maintenance window. As a Rails app grows we start to see significant time spent precompiling assets during the deploy. (A simple scaffolded Rails app took ~6 seconds while one of our larger client apps needs ~2 minutes to precompile assets.) There’s no reason to wait until the app is in maintenance mode to perform this work or to allow asset precompilation to extend our maintenance window. Precompiling assets locally and pushing them to Heroku can shorten our maintenance window by minutes.

Getting faster; precompiling slugs

Even without precompiling assets we see Heroku can spend several minutes compiling a slug when deploying our application.

From how Heroku works:

Terminology: A slug is a bundle of your source, fetched dependencies, the language runtime, and compiled/generated output of the build system – ready for execution.

We need to have this slug before we can start our new dynos or instruct Heroku to run our migrations but there’s no need to have the app in maintenance mode while we compile the slug. Using the pipelines labs feature we can compile a slug on one Heroku app before promoting it to our production environment.

Now things are somewhat complicated.

  1. Create a temporary application to use as a slug compiler (or reuse an existing application with the expected name).
  2. Deploy to this temporary application.
  3. Check if we need to run migrations.
  4. Use the pipeline to promote the compiled slug to the production application.
  5. Run migrations if needed.
  6. Re-deploy to the production application using a git push. Pipeline deploys do not update the git repository exposed on each application so this second deploy brings that git history up to date which we will need the next time we deploy and want to check for new db changes. We could avoid this if we preserve the slug compiling application and compare against that but then we need to remember to correctly use two apps.
  7. Destroy the slug compiling app as it is no longer needed.

On a current project this approach reduced the maintenance window from ~5 minutes to ~30 seconds (plus the duration of the migrations in both cases)

Future improvements

By moving asset and slug compilation out of our app’s maintenance window we can significantly reduce the length of our maintenance window. Combined with only entering maintenance mode when necesary we can deploy frequently with minimal impact on our users.

We might be able to go a step further and use preboot when maintenance mode is not needed to eliminate any interruption in serving requests. We might also adopt a process for incremental migrations to avoid ever needing to enter maintenance mode.

In my opinion the sort of slug precompilation performed in the script above should be a feature of the Heroku platform itself. Given an API with finer grained control of the code push – slug compilation – slug promotion cycle we could perform fast deploys without the complexity of managing multiple apps.

How much of an interruption do you tolerate on your production apps and does that out weigh the convenience of the Heroku platform?