On 2026-06-05, Reducto experienced elevated latency and job failures for document processing requests in our US production environment. The primary impact window was 07:24-08:25 PT.
Customers saw requests hang for longer than normal, and some jobs failed with server-side errors. Reducto on-call responded during the incident, increased processing capacity, and the system returned to baseline by approximately 08:15 PT.
Across customer production traffic in the primary window, excluding internal/test traffic, we measured:
| Metric | Count |
|---|---|
| Submitted jobs | 112,374 |
| Server-error failed jobs | 9,576 |
| Jobs lasting >=10 min | 29,977 |
| Jobs lasting >=20 min | 11,297 |
| Server-error failed or >=20 min | 19,040 jobs across 178 customer orgs |
| Time PT | Event |
|---|---|
| 07:24 | Reducto detected elevated pending work in the US processing system. |
| 07:25 | Reducto on-call engaged. |
| 07:49 | Investigation confirmed that most queued work was document parsing work. |
| 07:56 | Reducto increased downstream processing capacity. |
| 07:59 | Parse throughput began recovering. |
| 08:08 | During backlog drain, some jobs began failing against OCR dependencies. |
| 08:15 | The original pending-work alert resolved. |
| 08:20 | Customer-facing service was considered recovered. |
The incident was caused by a sudden spike in document-processing work combined with insufficient effective compute capacity in part of our inference stack. We had cutover to a new inference server yesterday that required higher CPUs under load. Under the spike, our server had cross model contention and the infrastructure provider could not supply the extra CPUs consistently, which caused model inference work to slow down while the queue continued to grow.
As the backlog built up, some requests waited a long time before processing and eventually failed or timed out. During recovery, the system drained the backlog quickly, which created a secondary burst of OCR traffic. That burst hit OCR-provider limits and caused additional server-side failures before the system fully stabilized.
We did not find evidence that this incident was triggered by a recent full production deploy immediately before the impact window.