Outages and Downtime

Vaccine scheduling systems must be available when people need them. To enable this, we recommend these systems adopt the best practices of Site-Reliability Engineering, which we will detail below, with mind to specific issues we have come across in evaluating various jurisdiction solutions.

Ways to use these recommendations:

  • For jurisdictions seeking new systems, the below concerns, in conjunction with the rest of this guide, should be part of the initial considerations around which system to acquire and/or switch to

  • For jurisdictions who have an ongoing relationship with a particular provider, the below can be part of the recommendations and requests that the jurisdiction can make of their existing provider

SRE best practices means that these services should maintain high service availability; engage in demand forecasting and capacity planning in order to respond gracefully to surges in traffic; adopt robust change management; allow for seamless system upgrades with minimal to no downtime; and they should log, track, and seek to minimize Mean Time To Repair.

One important framing for all of these recommendations is the so-called "Service Reliability Hierarchy", summarized in the Google SRE Book. With this model in mind, we organize our recommendations along this pyramid, going from the most basic, (likely easiest to make) changes (which correspond to the base), to the most advanced, and perhaps most difficult to implement (corresponding to the peak).

Below is the list of prioritized concerns to investigate, discuss, and where relevant, address:

Last updated