T4 Inference and Trainings, and Registry outage

Resolved·Full outage

Everything impacted has returned to normal operation.

Thu, Jun 12, 2025, 09:57 PM

(2 days ago)

Affected components

Jun 12, 2025, 06:41 PM

09:45 PM

Replicate Registry (r8.im)

Official Models

Updates

Resolved

Everything impacted has returned to normal operation.

Thu, Jun 12, 2025, 09:57 PM

Monitoring

All T4 hardware class instances have recovered to the same functional level as before the start of this incident.

Thank you for your patience while we worked through the issues

Thu, Jun 12, 2025, 09:45 PM(12 minutes earlier)

Monitoring

Most T4 Hardware based models are running as expected. We are still seeing some in a crashed/error state. The bulk of the unscheduled instances have scheduled and generally inference, training, and scaling have recovered.

Thu, Jun 12, 2025, 09:35 PM

Monitoring

We are seeing recovery of the impacted Official Models.

We are still monitoring the T4 instances, though we see usability of T4 for Inference and Trainings to already be improving.

Thu, Jun 12, 2025, 09:32 PM

Monitoring

We are monitoring a slow recovery of impacted services. At this time Registry (r8.im) is back online and accepting new images. We will provide updates as soon as we see restoration of impacted official models and T4 Hardware inference and training.

Thu, Jun 12, 2025, 09:23 PM

Identified

We are aware of a number of outages impacting our upstream providers. We are continuing to monitor the situation and associated issues.

As of this time the impact for inference and training appears to remain limited to the T4 Hardware class.

We are also aware that it is not possible to push new cog images to the "r8.im" registry. Images that are not cached in our downstream regions will not be able to boot at this time.

Additionally a small number of official models are seeing high incidence of failed predictions.

Thu, Jun 12, 2025, 07:34 PM(1 hour earlier)

Identified

Thu, Jun 12, 2025, 07:29 PM

Investigating

We are currently experiencing an outage with T4 hardware, all models that rely on this hardware have degraded performance.

Thu, Jun 12, 2025, 06:41 PM(48 minutes earlier)