AI Engineering
From Prototype to Production: Productionizing LLM Applications
A practical guide to hardening LLM integrations against the failure modes that kill AI features before they scale.

Moving an LLM application from prototype to production requires more than wrapping an API call. The gap between a demo that works in controlled conditions and a system that handles real-world traffic is where most AI projects fail.
The Fragility Problem
Anyone can build a prompt wrapper. Few can prevent prompt injection, drift, and unexpected user outputs.
Input Sanitization
Your first line of defense is input validation. Don’t just pass user input directly to the LLM:
// ❌ Don't do this
const response = await openai.chat.completions.create({
messages: [{ role: 'user', content: userInput }],
});
// ✅ Do this
const sanitized = sanitizeInput(userInput);
const response = await openai.chat.completions.create({
messages: [{ role: 'user', content: sanitized }],
});
Create an allow-list of acceptable patterns and reject anything that doesn’t match. For applications handling sensitive domains, consider using dedicated libraries for detecting jailbreak attempts.
Output Validation
LLMs are non-deterministic. Even with temperature set to 0, outputs can vary. Always validate:
- Schema compliance: Does the output match your expected JSON schema?
- Content safety: Use secondary models or rules-based systems to scan for harmful content
- Coherence checks: Does the response make sense in context?
The Scaling Bottleneck
AI features are useless if your database locks up, your frontend lags, or your token costs spiral out of control.
Token Economics
LLM costs scale with usage. A feature that costs £0.01 per request becomes expensive at 100,000 requests per day.
Strategies to control costs:
- Caching: Cache identical requests. Embeddings are deterministic — never compute the same embedding twice
- Model tiering: Use smaller, cheaper models for simple tasks; reserve GPT-4 for complex reasoning
- Request batching: Group requests when possible to reduce overhead
- Streaming responses: Show progress to users while tokens arrive
Database Considerations
LLM applications often need to store:
- Conversation history
- Generated content
- User feedback
Plan your schema for the write patterns you’ll actually see. Append-only conversation logs need different indexing than random-access document stores.
The Observability Gap
If your AI fails silently in front of a client, how long until you notice?
Essential Metrics
Track these at minimum:
| Metric | Why It Matters |
|---|---|
| Token usage per request | Cost forecasting |
| Latency (p50, p95, p99) | User experience |
| Error rate by type | Reliability |
| Output quality scores | Feature health |
Distributed Tracing
Every LLM call should be part of a trace:
const trace = telemetry.startSpan('llm.invoke');
try {
const result = await model.generate(input);
trace.setAttribute('tokens.used', result.usage.total_tokens);
return result;
} catch (error) {
trace.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
trace.end();
}
Production Guardrails
Here’s a production-ready pattern for LLM invocation:
async function invokeLLM(ctx: InvocationContext) {
// 1. Sanitize input
const sanitized = sanitizeInput(ctx.input);
// 2. Check cache
const cached = await cache.get(hash(sanitized));
if (cached) return cached;
// 3. Start telemetry
const trace = telemetry.startSpan('llm.invoke');
// 4. Invoke with circuit breaker
try {
const result = await circuitBreaker.fire(() =>
model.generate(sanitized, { signal: ctx.abortSignal })
);
// 5. Validate output
assertSafeOutput(result);
// 6. Cache and return
await cache.set(hash(sanitized), result, ctx.ttl);
return result;
} finally {
trace.end();
}
}
The Human Element
Even with perfect automation, have a human review process for:
- New failure modes discovered in production
- Edge cases that bypass filters
- User complaints and feedback
Build a feedback loop: when users report bad outputs, capture the full context for analysis.
Summary
Productionizing LLM applications requires treating the LLM as an unreliable dependency — one that needs input validation, output checking, comprehensive monitoring, and graceful degradation patterns. The teams that succeed are the ones that assume failure and build systems that handle it.
Frequently Asked Questions
What are the most common failure modes in production LLM applications?
The three critical failure modes are: 1) Fragility - prompt injection, drift, and unexpected outputs; 2) Scale - database locks, frontend lag, and spiraling token costs; 3) Observability - silent failures that go unnoticed until users report them.
How do you prevent prompt injection attacks?
Implement defense in depth: input sanitization with allow-list validation, output scanning for harmful content, rate limiting per user, and human-in-the-loop for high-stakes operations.
What observability metrics matter most for LLM applications?
Track token usage and costs per request, latency percentiles (p50, p95, p99), error rates by error type, output quality scores, and user feedback signals.