An engineering case study on redesigning a resource-constrained, single-process CSV import path into a horizontally scalable, queue-driven distributed ingestion system supporting 100K+ records per import with fair multi-tenant scheduling.
A core data-ingestion path allowed users to upload CSV files of records into the platform. The original implementation processed entire uploads synchronously inside a user-facing Edge Service, which capped practical import size at roughly 3–4K records and produced CPU/memory spikes that degraded stability for all tenants — not just the user running the import.
I owned the redesign end to end: decomposing the import path into a distributed, queue-based background processing system built on streaming CSV ingestion, object-storage-backed batch durability, a priority queue, and a horizontally scalable worker pool — with explicit fairness scheduling so large imports could not starve small ones.
Key results:
| Dimension | Before | After |
|---|---|---|
| Max import size (practical) | ~3K records | 100K+ records (20×) |
| Processing throughput | baseline | ~5× faster |
| Resource behavior | CPU/memory spikes, shared-tenant impact | Isolated, bounded, stable |
| Fairness | Large files starved small ones | Priority + round-robin, no starvation |
A reader should walk away understanding that this was not a feature — it was a system redesign spanning architecture, concurrency, multi-tenant fairness, data consistency, and capacity planning.
The platform lets users bulk-import records via CSV. As adoption grew, import sizes grew with it — but the ingestion path had been built for small files and never re-architected for scale.
Why the system existed. Bulk import is a primary onboarding and daily-use workflow. If it is slow, fragile, or capped, users cannot get their data into the platform, which directly blocks the value they came for.
The core engineering problem. All parsing, validation, and persistence happened inline inside the Edge Service — the same tier serving live, latency-sensitive user requests. This created three compounding failure modes:
Why it was difficult. This is not “make it async and walk away.” Moving to background processing introduces a new set of hard problems: durability of in-flight work, multi-tenant fairness, idempotency on retry, data consistency under concurrent writes, and capacity planning for a worker fleet. The redesign had to preserve correctness (no dropped or duplicated records) while removing the size ceiling and the shared-fate instability.
Scale & constraints. Target: a single import of 100K+ records, multiple tenants importing concurrently, with per-account quota enforcement and strong consistency on account-level usage counters. All of this on production-like infrastructure with bounded, predictable resource cost.