
What changed when we split the pools
We ran a two-week proof of concept. I split the cluster into two pools: Eight GPUs dedicated to prompt processing and the remaining GPUs handling token generation. No new hardware, no new cluster — just a configuration change in the serving layer and a routing policy that sent each request to the right pool based on its inference phase. The prompt-processing pool hit 90–95% compute utilization consistently because that’s all it did. No token generation competing for scheduling slots. No decode requests sitting idle while a prefill burst hogged the cores.
The token-generation pool was the bigger surprise. By batching hundreds of concurrent decode requests together the memory reads got amortized across more work. Bandwidth utilization climbed above 70% — far better than the 30% we’d been seeing when decode requests were interleaved with prefill on the same GPU. Overall compute efficiency roughly doubled.
The cost math followed. The customer was spending about $2M annually on inference GPU-hours. After disaggregation they were on track to cut that by $600–800K while serving the same request volume at the same latency targets. No new hardware purchased. Same GPUs, same cluster, same model weights — different architecture.

