Summary

On 2026-06-05, Reducto experienced elevated latency and job failures for document processing requests in our US production environment. The primary impact window was 07:24-08:25 PT.

Customers saw requests hang for longer than normal, and some jobs failed with server-side errors. Reducto on-call responded during the incident, increased processing capacity, and the system returned to baseline by approximately 08:15 PT.

Impact

Across customer production traffic in the primary window, excluding internal/test traffic, we measured:

Metric Count
Submitted jobs 112,374
Server-error failed jobs 9,576
Jobs lasting >=10 min 29,977
Jobs lasting >=20 min 11,297
Server-error failed or >=20 min 19,040 jobs across 178 customer orgs

Timeline

Time PT Event
07:24 Reducto detected elevated pending work in the US processing system.
07:25 Reducto on-call engaged.
07:49 Investigation confirmed that most queued work was document parsing work.
07:56 Reducto increased downstream processing capacity.
07:59 Parse throughput began recovering.
08:08 During backlog drain, some jobs began failing against OCR dependencies.
08:15 The original pending-work alert resolved.
08:20 Customer-facing service was considered recovered.

Root cause

The incident was caused by a sudden spike in document-processing work combined with insufficient effective compute capacity in part of our inference stack. We had cutover to a new inference server yesterday that required higher CPUs under load. Under the spike, our server had cross model contention and the infrastructure provider could not supply the extra CPUs consistently, which caused model inference work to slow down while the queue continued to grow.

As the backlog built up, some requests waited a long time before processing and eventually failed or timed out. During recovery, the system drained the backlog quickly, which created a secondary burst of OCR traffic. That burst hit OCR-provider limits and caused additional server-side failures before the system fully stabilized.

We did not find evidence that this incident was triggered by a recent full production deploy immediately before the impact window.

What we're doing about it

What you can do