Vaccine Provider Guide
  • COVID-19 Vaccine Provider Guide: A Technology Lens
  • Introduction
    • How is this guide different?
    • Who created this guide?
    • Next steps after reading this guide
  • Vaccination Rollout
    • Background
    • Key Areas
      • 1. Confirming eligibility of patients
      • 2. Scheduling appointments and reminders
      • 3. Collecting patient information
      • 4. Administering vaccines on-site
      • 5. Reporting data to your state’s Immunization Information System (IIS)
      • 6. Billing insurance and claims for the uninsured
      • 7. Managing vaccine inventory (ongoing)
      • 8. Communicating with the public (ongoing)
        • Key considerations for messaging
        • Public communication templates & resources
          • Free county website template
          • USDR research on at-risk and vulnerable communities
          • Guidance from public health practioners
          • Helpful government sites
      • Miscellaneous considerations
  • Tech tools
    • Tool Categories
    • Vendor Categories
      • How to Evaluate Vendors
        • Identify your goals and constraints
        • Choose vendors to evaluate
        • Answer key questions
        • Scheduling / queue management tool - Questions
        • Patient registration tool - Questions
    • Switching Vendors
      • Switching Costs
      • Contract Modifications & Switching
    • Summary of Findings
      • Common Challenges
      • Evaluating Media
      • Common User Interface Issues
      • Outages and Downtime
      • Limitations of Technology
    • Vendor Reviews
  • What's next?
  • Acknowledgements
Powered by GitBook
On this page

Was this helpful?

  1. Tech tools
  2. Summary of Findings

Outages and Downtime

PreviousCommon User Interface IssuesNextLimitations of Technology

Last updated 4 years ago

Was this helpful?

Vaccine scheduling systems must be available when people need them. To enable this, we recommend these systems adopt the of Site-Reliability Engineering, which we will detail below, with mind to specific issues we have come across in evaluating various jurisdiction solutions.

Ways to use these recommendations:

  • For jurisdictions seeking new systems, the below concerns, in conjunction with the rest of this guide, should be part of the initial considerations around which system to acquire and/or switch to

  • For jurisdictions who have an ongoing relationship with a particular provider, the below can be part of the recommendations and requests that the jurisdiction can make of their existing provider

SRE best practices means that these services should maintain high service availability; engage in demand forecasting and capacity planning in order to respond gracefully to surges in traffic; adopt robust change management; allow for seamless system upgrades with minimal to no downtime; and they should log, track, and seek to minimize Mean Time To Repair.

One important framing for all of these recommendations is the so-called "Service Reliability Hierarchy", summarized in the Google . With this model in mind, we organize our recommendations along this pyramid, going from the most basic, (likely easiest to make) changes (which correspond to the base), to the most advanced, and perhaps most difficult to implement (corresponding to the peak).

Below is the list of prioritized concerns to investigate, discuss, and where relevant, address:

Question

Explanation

Are logging, monitoring, and alerting built in?

These should be standard features of these systems to ensure early problem detection:

  1. Logs to catch issues as they happen

  2. Monitoring to see whether the site is down or under heavy load

  3. Alerting to let the on-call person know right when there is an outage

Does the site have a plan in place for when there is downtime?

In the unlikely event of downtime, these systems should:

  • Gracefully failover to a minimal site

  • Gracefully disable non-functioning features

  • Ensure data integrity and avoid data corruption

Are there formalized uptime guarantees?

Is there a guarantee built into the contract of system uptime percentage? Such Service Level Agreement (SLA) clauses are standard in many agreements, and a good idea, especially if coupled with explicit callouts of Service-Level Indicators (SLIs) to watch for, working toward explicit Service-Level Objectives (SLOs) in formally defining this uptime. This should also be accompanied by a guaranteed Mean Time to Repair, on-call procedure, and support.

Has the vendor planned for the possibility of downtime?

  1. Vendors should engage in capacity planning and forecasting, with a mind to the unique traffic scenarios that this website represents.

  2. Systems should come equipped with an incident response plan, complete with Incident Management Process and dedicated on-call rotation

  3. During development, developers of these systems should adopt distributed load testing tools to simulate realistic high-traffic scenarios. Systems should not be rushed to the finish line without such robust testing infrastructure in place.

Is it easy to update the system?

  1. Can the system be updated without downtime? It should not be necessary to take a website down entirely to perform upgrades if it is using modern best-practice development techniques.

  2. Vendors should adopt best practices of release engineering, including continuous build and deployment and configuration management in order to release new versions easily and seamlessly.

Is it robust to intentional attacks / bots?

In addition to dealing with high standard load, these systems should implement DoS / DDoS Protection, by means of IP blocking, DDoS protection, and other similar tools, to avoid such malicious actors.

Is the system cloud-based?

  1. A scalable, cloud-native solution is preferred, to be able to handle frequent traffic surges through failover, load-balancing, and other strategies.

  2. Note that just being cloud-based is often not sufficient in and of itself - the tool should adopt proper capacity planning and use cloud resources in a way that allows for rapid and frequent traffic surges: it should utilize multiple geographic availability zones, load balancers, and other modern practices to ensure that the impact of a surge in traffic is limited.

best practices
SRE Book