
This simple de-duplication reduced our storage requirement by more than 40%. In distributed systems like the ones powering Instagram or YouTube, storage isn’t passive; it requires CPU to manage, compress and replicate. By slashing the storage footprint, we improved bandwidth availability for the distributed workers fetching data for training, creating a virtuous cycle of efficiency throughout the stack.
Auditing the feature usage
The final piece of the puzzle was spring cleaning. In a system as old and complex as a major social network’s recommendation engine, digital hoarding is a real problem. We had over 100,000 distinct features registered in our system.
However, not all features are created equal. A user’s “age” might carry very little weight in the model compared to “recently liked content.” Yet, both cost resources to compute, fetch and log.

