
This shift marks a major departure from the traditional shop model of earlier internet days, where each company managed its own system, and failures were contained. Today, when an LLM or its cloud host encounters issues, the impact spreads quickly across dozens and sometimes hundreds of dependent businesses in real time. This was clearly demonstrated in 2025 when both a key LLM provider and its cloud infrastructure faced outages. For nearly seven hours, applications powered by LLMs, ranging from legal AI tools to customer service chatbots and supply chain decision systems, became inoperative. The financial losses were significant and tangible: billions lost in revenue and huge costs for emergency fixes.
Outages become more frequent
It is tempting to dismiss large-scale cloud or LLM failures as rare, black-swan events that won’t recur for years. But this is wishful thinking. By relying on a few hyperscale providers for the computational power of enterprise applications, we have created centralized points of failure in our most vital business systems. The convenience and cost-efficiency of third-party LLMs hide a fragile truth: As more organizations rely on these shared services for their data, reasoning, and engagement, each provider becomes a bigger target for operational issues, cyberattacks, misconfigurations, or software bugs.
Furthermore, the demand for LLM services is growing rapidly, pushing the limits of current infrastructure and increasing the risk of overload. Providers are also evolving quickly, layering new models and capabilities on top of complex legacy cloud systems. This creates unstable ground beneath what many executives expect to be a “set-and-forget” solution.

