Resolved
Everything impacted has returned to normal operation.
Monitoring
All T4 hardware class instances have recovered to the same functional level as before the start of this incident.
Thank you for your patience while we worked through the issues
Monitoring
Most T4 Hardware based models are running as expected. We are still seeing some in a crashed/error state. The bulk of the unscheduled instances have scheduled and generally inference, training, and scaling have recovered.
Monitoring
We are seeing recovery of the impacted Official Models.
We are still monitoring the T4 instances, though we see usability of T4 for Inference and Trainings to already be improving.
Monitoring
We are monitoring a slow recovery of impacted services. At this time Registry (r8.im) is back online and accepting new images. We will provide updates as soon as we see restoration of impacted official models and T4 Hardware inference and training.
Identified
We are aware of a number of outages impacting our upstream providers. We are continuing to monitor the situation and associated issues.
As of this time the impact for inference and training appears to remain limited to the T4 Hardware class.
We are also aware that it is not possible to push new cog images to the "r8.im" registry. Images that are not cached in our downstream regions will not be able to boot at this time.
Additionally a small number of official models are seeing high incidence of failed predictions.
Identified
We are aware of a number of outages impacting our upstream providers. We are continuing to monitor the situation and associated issues.
As of this time the impact for inference and training appears to remain limited to the T4 Hardware class.
We are also aware that it is not possible to push new cog images to the "r8.im" registry. Images that are not cached in our downstream regions will not be able to boot at this time.
Investigating
We are currently experiencing an outage with T4 hardware, all models that rely on this hardware have degraded performance.