Tenant fairness on shared inference

The pressure case

Shared inference should not let one noisy tenant turn the rest of the system into a queue of apologies. KVWarden Gate 2 tests a simple fairness question: when a flooder pushes load into the scheduler, does a quiet tenant still get service that resembles solo performance?

The current run holds quiet tenant TTFT at 61.5 ms under flooder pressure, against 53.9 ms in the solo case. FIFO does not hold that line. In this setup KVWarden is 26x better than FIFO at protecting the quiet tenant.

What changed

KVWarden separates admission, token budget, and eviction pressure into explicit accounting. The scheduler does not treat every arrival as equal once tenant behavior diverges. It watches the load shape and spends the shared cache with a bias toward bounded harm.

type TenantBudget = {
  tenantId: string;
  tokensInFlight: number;
  cachePressure: number;
  lastServedAt: number;
};

Why this matters

Inference systems are becoming shared infrastructure. That is good for utilization and bad for fairness unless the scheduler earns its keep. The early result is not a claim that KVWarden is finished. It is a narrow result with numbers attached, and it gives the next gate something real to beat.

Fairness is not a slogan. It is a latency distribution with names attached.

Re-run Gate 2 across longer traces.
Publish the harness.
Compare against additional cache-aware baselines.
Keep the quiet tenant visible in every chart.

The pressure case

What changed

Why this matters

Next