
The novel power of today’s AI is in its ability to deal with intent. This is a superpower, no doubt, but it creates a huge imperative for app developers: the need to map between the anything-is-possible large language model (LLM) and the strict capabilities of code.
Unrestrained, LLM endpoints will let your user create unicorns and leprechauns while your back end can handle only purchase orders and customer profiles. You must harness the LLM’s ability to understand intent to what the app is logically capable of, meanwhile keeping context (and therefore spend) under control. Here I’ll discuss some practical, realistic techniques for doing that today.
Between what the user wants to do and what your app is capable of is you. Or, more specifically, the mediation layer you build. This layer can sit anywhere on a broad spectrum, from using incredibly lightweight inline strings to using a massive retrieval-augmented generation (RAG) system backed by a vector database. Somewhere in there is the sweet spot for your particular project.
It turns out there is a great deal you can do without resorting to the extra infrastructure of a vector database, and indeed, one should avoid that until it is really, truly needed. The first step in keeping your AI API’s manageable is the response schema.
Response schemas
Probably the single most potent weapon in your arsenal, the essential first move, is forcing the responses from the AI model into a well-defined structure, often JSON.
Not long ago, this was a hit-or-miss affair. The developer essentially begged the model for a structured response, by adding “Respond with structured JSON like this: { “name” : “string }” to the prompt. And this would kind of work, but sometimes the AI would add a helpful “Here is your JSON: ” and the response handler would break.
Recent models are much better about this. They have specific “circuits” that watch for these response structure indications. This means they are much more reliable when you issue even a complex prompt query that ends with a response schema enforcement.
However, there is an even more rigorous way to enforce a structured response with newer models, which is to define the response mime type in the request. For example, in Gemini:
responseMimeType: "application/json"
Different models use slightly different names (for example, ChatGPT uses response_format). The beauty of this approach is that it is completely domain-agnostic. In an enterprise context, a user might prompt the model with a messy, fragmented request like, “I need to restock those red ergonomic chairs for the Austin office, grab me a dozen.” With a schema, the AI parses that intent into a clean {"sku": "CHR-RED-ERG", "quantity": 12, "location_id": "TX-AUS"}. The mime type will ensure that you get structured JSON in the response.
Along with passing a mime type, you can pass a JSON schema to the LLM. This schema will ensure that the model’s response meets the shape and keys for the data you define. Usually, a JSON schema library like Zod is used to define the schema and validate the response. For example, for our chairs, you might use Zod to define:
import { z } from "zod";
const orderZodSchema = z.object({
sku: z.string(),
quantity: z.number().int(),
location_id: z.string()
});
I’ve been relying heavily on this exact pattern while architecting the engine for an open-source, AI-mediated MUD called Terra Agnostum. In a text-based RPG, players input highly descriptive, unpredictable actions. If a user types, “I cautiously inspect the humming control panel for sabotage,” the LLM understands the narrative intent. But to actually resolve the action, the back end needs a JSON payload containing the exact mechanical requirements—identifying the target object, selecting the correct engine skill check, and mapping the action to the player’s core stats (like their AWARE or WILL attributes).
Function calling
If response schemas are the gateway to manageable LLM services, the next level up is function calling (sometimes called “tool use”). The basic idea here is to take your specific, in-app functions (the “tools”) and hand them to the LLM as part of the prompt request. Passing the model your functions gives it exact knowledge of what tools are available, i.e. what it can do code-wise in the context of the application. The model then can return an actual function call with arguments based on the intent.
To be clear, the LLM does not actually call the function. What happens is, you tell the LLM what functions are available (along with their signatures), and the LLM selects one, replies with what it wants to call, and then your code makes the actual call.
Let’s take an example from a standard enterprise CRM or ERP system. A sales director opens the command bar and types, “Pull the Q3 revenue numbers for the EMEA region, compare them to Q2, and email a summary to the regional VP.”
If you pass that to a standard generative endpoint, it will likely hallucinate a polite response about how it doesn’t have access to your database. But with function calling, you can pass the prompt along with a JSON array defining the signatures of two of your back-end tools: get_revenue_data(region, quarter) and send_email(recipient_role, data).
The AI reads the intent and immediately recognizes it shouldn’t generate text to the user. Instead, it pauses and returns a structured payload to your application: {"function": "get_revenue_data", "arguments": {"region": "EMEA", "quarter": ["Q2", "Q3"]}}.
Your deterministic code takes over. It runs the SQL query, grabs the numbers, and hands that raw data back to the LLM in the context window. The LLM processes the data, writes the summary, and then outputs a second function call: {"function": "send_email", "arguments": {"recipient_role": "Regional VP", "data": "[Summary Text]"}}.
As with response mime type and schema definition, different models use different methods to accept function calls. You can see examples from ChatGPT and Gemini here and here.
The ‘back end’ is wherever your state lives
It is worth noting that the LLM doesn’t care where your code executes. When an LLM makes a function call, it isn’t executing code; it is merely returning a structured JSON payload. It is up to your app, wherever it may be running, to parse the JSON and execute the function.
In a traditional architecture, your server receives that JSON and runs a database query. But in a modern single-page application (SPA) or a serverless architecture, the response might be captured by a client-side JavaScript function running directly in the user’s browser, or it might be handled by a serverless function (e.g. Vercel Functions).
In the Terra Agnostum engine, the LLM frequently returns function calls that never touch a server. Instead, they trigger client-side JavaScript to immediately update the player’s local state manager, inject a new item into their UI inventory, or trigger a CSS visual update. Whether your back end is a big Java monolith talking to a SQL database or a lightweight client-side state manager, the orchestration pattern is identical: the LLM handles the intent, and your rigid code handles the execution.
Passing function calls back to a client-side environment (like a React or vanilla JS app) is fantastic for rapid prototyping and local state management. However, remember the golden rule of application security: never trust the client.
If your LLM returns a payload instructing the browser to grant_admin_privileges() or add_funds_to_account(), a malicious user could simply open their browser console and execute that local JavaScript function themselves, bypassing your application. For anything involving sensitive data, financial transactions, or shared-world state, the LLM must return its function calls to a secure, server-side environment where the execution cannot be spoofed.
Prompt routing
When working with LLM services, it pays to think about how requests are routed, especially in discovering those cases where the service can be avoided entirely. The bottom line reality is that calls to LLM endpoints mean lag. You are doing a network request, like any service call, but AI services tend to churn more than others. It’s just the nature of their work.
Anytime you can avoid that call in the first place, you are winning. Not only are you avoiding latency, you are dodging the other big bugbear in AI: spend. You start to really see dollar signs in place of tokens when you work a lot with an LLM service.
Returning to the Terra Agnostum game engine, this hybrid approach is essential for a playable experience. If a player hits the North movement and the room to the north already exists in the local map cache, the application’s deterministic router intercepts the command and simply updates the database coordinates. The room description loads instantly.
In enterprise software settings, such as a purchase system, these kinds of hard-coded happy paths already exist and can be directly relied on when the fuzzy intent layer is not required.
MCP vs. the capability layer
Model Context Protocol (MCP), an open standard originally introduced by Anthropic, has become widely used by developers to connect LLMs with all manner of tools, services, and data sources. Which begs the question… When should you adopt a standardized protocol like MCP, and when should you just build an internal capability layer into your application?
The answer maps somewhat to an old architectural debate: service-oriented architecture (SOA) versus the classic model-view-controller (MVC) pattern.
MCP is designed for decoupled, dynamic discovery. In this architecture, the AI agent is the “customer” being catered to. It reaches out across boundaries to discover what tools, databases, or APIs are available to it, much like an enterprise service bus in an SOA implementation. If you are building a universal AI assistant that needs to autonomously query Jira, pull from GitHub, and cross-reference Slack, MCP is the right choice.
Conversely, building an internal capability layer, like we did in the Terra Agnostum engine, is app-centric. It is tightly coupled. The application itself is the boss, and the LLM is just a translation microservice acting as a controller. The LLM doesn’t discover anything; the application explicitly hands it a strict, state-dependent menu of functions and says, “Translate the user’s fuzzy intent into one of these specific actions.”
If you are building a stand-alone, purpose-built application (whether that’s a corporate procurement tool or an RPG), you rarely need a universal agent. You need a well-regulated mediator. Don’t over-engineer a dynamic, decoupled agent ecosystem when a tightly coupled capability layer will give you better security, lower latency, and strict execution control.
Context management
In the world of LLM services, context is the fuel, but it is also the primary driver of cost and latency. Every character you send to an LLM increases the processing time and the bill. The most common architectural mistake is “context sprawl,” sending the model everything including the kitchen sink in hopes that it will figure it out.
Instead, think of context as a hierarchy of complexity. You should move up to the next level only when the business requirements force you to do so.
- Level 1: The surgical string (state-driven routing). This is the leanest form of context. Based on the user’s current page or state (e.g., “The user is on the Billing tab”), you inject a short and specific instruction. It is low overhead and extremely fast.
- Level 2: Context pinning (persistent truths). This involves identifying a small, “static” set of truths—like a user’s role or a specific set of business rules—and pinning them to every request. In Terra Agnostum, this is used for “Lore Injection,” which ensures the LLM always knows the fundamental rules of the world without searching a database.
- Level 3: Zero-DB RAG (local archive): Before reaching for a heavy database, look at your existing filesystem. If your documentation or rules fit in a few markdown files, just read them into the context window as needed. It utilizes your existing infrastructure with medium token cost and low overhead.
- Level 4: Vector RAG (semantic engine): This is the heavyweight champion. It involves chunking data, generating mathematical embeddings, and using a vector database like Pinecone or Milvus. This is only necessary for massive oceans of unpredictable data.
We should also mention in passing the emerging “stateful” API space. There are tools like Google’s context caching that allow you to attach context more cheaply or more persistently.
Developer and mediator
The practical truth about using an LLM as a service is that if you aren’t careful, you will end up with “résumé-driven development” — i.e., an application that incorporates complex vector RAG because it looks good on a LinkedIn profile, even when a surgical string would have been faster and cheaper and… saner.
Probably the most challenging part of software development with AI (on all sides) is keeping a human mind wrapped around the intense amount of content an LLM can generate. As app developers, our job has shifted. We are no longer just writing the code that executes; we are the mediators between the code, the AI, and the users. We build the fast paths to save on latency, we define the function menus to ensure security, and we curate the context to protect the budget.
Tame your back end by being a minimalist. Every time you can avoid an AI call, or make that call shorter and more structured, you aren’t just saving money, you’re building a more responsive and reliable application.

