Web outage - Replicate Status

Write-up

Web outage

Thank you for your patience as we work through our post-incident process. During the ~31 minutes of the incident between identification and remediation, users likely experienced at least one of the following:

- Web page timeouts and 500-level errors

- API timeouts and 500-level errors

- Delayed predictions

The problematic deploy that we rolled back in response to the incident contained a seemingly-innocuous change to the usage of a Django queryset. The result of this change was that a very expensive query was run potentially many times per request, which increased the database load dramatically and impacted all other database activity.

In response to how this incident progressed, we have updated our incident response process to reduce confusion and delays. We have additionally updated our code review and CI processes to help prevent future instances of this type of unbound query.