High-Throughput Distributed Bulk Import Platform

An engineering case study on redesigning a resource-constrained, single-process CSV import path into a horizontally scalable, queue-driven distributed ingestion system supporting 100K+ records per import with fair multi-tenant scheduling.


1. Executive Summary

A core data-ingestion path allowed users to upload CSV files of records into the platform. The original implementation processed entire uploads synchronously inside a user-facing Edge Service, which capped practical import size at roughly 3–4K records and produced CPU/memory spikes that degraded stability for all tenants — not just the user running the import.

I owned the redesign end to end: decomposing the import path into a distributed, queue-based background processing system built on streaming CSV ingestion, object-storage-backed batch durability, a priority queue, and a horizontally scalable worker pool — with explicit fairness scheduling so large imports could not starve small ones.

Key results:

Dimension Before After
Max import size (practical) ~3K records 100K+ records (20×)
Processing throughput baseline ~5× faster
Resource behavior CPU/memory spikes, shared-tenant impact Isolated, bounded, stable
Fairness Large files starved small ones Priority + round-robin, no starvation

A reader should walk away understanding that this was not a feature — it was a system redesign spanning architecture, concurrency, multi-tenant fairness, data consistency, and capacity planning.


2. Business Problem

The platform lets users bulk-import records via CSV. As adoption grew, import sizes grew with it — but the ingestion path had been built for small files and never re-architected for scale.

Why the system existed. Bulk import is a primary onboarding and daily-use workflow. If it is slow, fragile, or capped, users cannot get their data into the platform, which directly blocks the value they came for.

The core engineering problem. All parsing, validation, and persistence happened inline inside the Edge Service — the same tier serving live, latency-sensitive user requests. This created three compounding failure modes:

  1. Resource contention as a shared-fate problem. A single large import spiked CPU and memory on a request-serving node, degrading unrelated users’ experience. The blast radius of one user’s import was the entire node.
  2. A hard ceiling on import size. Holding and processing a whole file in one process meant memory grew with file size. Beyond ~3–4K records, the process became unstable.
  3. No load distribution. Single-process, single-node execution meant there was no way to spread work; the only “scaling” lever was praying the file was small.

Why it was difficult. This is not “make it async and walk away.” Moving to background processing introduces a new set of hard problems: durability of in-flight work, multi-tenant fairness, idempotency on retry, data consistency under concurrent writes, and capacity planning for a worker fleet. The redesign had to preserve correctness (no dropped or duplicated records) while removing the size ceiling and the shared-fate instability.

Scale & constraints. Target: a single import of 100K+ records, multiple tenants importing concurrently, with per-account quota enforcement and strong consistency on account-level usage counters. All of this on production-like infrastructure with bounded, predictable resource cost.