Scale becomes manageable when GenAI is treated as a service with explicit constraints and measurable outcomes. It’s best to rely on a set of production disciplines to get there.

UST
Define the production contract
Write the contract for the experience you plan to operate. Put numbers on it. Include p95 latency, availability, error budget, and expected behavior under load. Add a cost envelope per request. Capture policy requirements for data access, citation, and tool use.
This step changes design choices quickly. A team with a 2.5-second p95 target makes different retrieval and routing choices than a team that can tolerate 10 seconds. A team with a three-cent per answer budget makes different model tier choices than a team with a fifty-cent budget.
Treat retrieval as the main system
Most enterprise assistants rely on retrieval-augmented generation. Retrieval drives answer quality. Retrieval also drives unit economics through context size, re-ranking, and repeat work. I spend more time on retrieval quality than on prompt wording.

