Schedule a call

AI Engineering

From Prototype to Production: Productionizing LLM Applications

A practical guide to hardening LLM integrations against the failure modes that kill AI features before they scale.

· Tom Eustace
From Prototype to Production: Productionizing LLM Applications

Moving an LLM application from prototype to production requires more than wrapping an API call. The gap between a demo that works in controlled conditions and a system that handles real-world traffic is where most AI projects fail.

The Fragility Problem

Anyone can build a prompt wrapper. Few can prevent prompt injection, drift, and unexpected user outputs.

Input Sanitization

Your first line of defense is input validation. Don’t just pass user input directly to the LLM:

// ❌ Don't do this
const response = await openai.chat.completions.create({
  messages: [{ role: 'user', content: userInput }],
});

// ✅ Do this
const sanitized = sanitizeInput(userInput);
const response = await openai.chat.completions.create({
  messages: [{ role: 'user', content: sanitized }],
});

Create an allow-list of acceptable patterns and reject anything that doesn’t match. For applications handling sensitive domains, consider using dedicated libraries for detecting jailbreak attempts.

Output Validation

LLMs are non-deterministic. Even with temperature set to 0, outputs can vary. Always validate:

  • Schema compliance: Does the output match your expected JSON schema?
  • Content safety: Use secondary models or rules-based systems to scan for harmful content
  • Coherence checks: Does the response make sense in context?

The Scaling Bottleneck

AI features are useless if your database locks up, your frontend lags, or your token costs spiral out of control.

Token Economics

LLM costs scale with usage. A feature that costs £0.01 per request becomes expensive at 100,000 requests per day.

Strategies to control costs:

  1. Caching: Cache identical requests. Embeddings are deterministic — never compute the same embedding twice
  2. Model tiering: Use smaller, cheaper models for simple tasks; reserve GPT-4 for complex reasoning
  3. Request batching: Group requests when possible to reduce overhead
  4. Streaming responses: Show progress to users while tokens arrive

Database Considerations

LLM applications often need to store:

  • Conversation history
  • Generated content
  • User feedback

Plan your schema for the write patterns you’ll actually see. Append-only conversation logs need different indexing than random-access document stores.

The Observability Gap

If your AI fails silently in front of a client, how long until you notice?

Essential Metrics

Track these at minimum:

Metric Why It Matters
Token usage per request Cost forecasting
Latency (p50, p95, p99) User experience
Error rate by type Reliability
Output quality scores Feature health

Distributed Tracing

Every LLM call should be part of a trace:

const trace = telemetry.startSpan('llm.invoke');
try {
  const result = await model.generate(input);
  trace.setAttribute('tokens.used', result.usage.total_tokens);
  return result;
} catch (error) {
  trace.setStatus({ code: SpanStatusCode.ERROR });
  throw error;
} finally {
  trace.end();
}

Production Guardrails

Here’s a production-ready pattern for LLM invocation:

async function invokeLLM(ctx: InvocationContext) {
  // 1. Sanitize input
  const sanitized = sanitizeInput(ctx.input);
  
  // 2. Check cache
  const cached = await cache.get(hash(sanitized));
  if (cached) return cached;
  
  // 3. Start telemetry
  const trace = telemetry.startSpan('llm.invoke');
  
  // 4. Invoke with circuit breaker
  try {
    const result = await circuitBreaker.fire(() =>
      model.generate(sanitized, { signal: ctx.abortSignal })
    );
    
    // 5. Validate output
    assertSafeOutput(result);
    
    // 6. Cache and return
    await cache.set(hash(sanitized), result, ctx.ttl);
    return result;
  } finally {
    trace.end();
  }
}

The Human Element

Even with perfect automation, have a human review process for:

  • New failure modes discovered in production
  • Edge cases that bypass filters
  • User complaints and feedback

Build a feedback loop: when users report bad outputs, capture the full context for analysis.

Summary

Productionizing LLM applications requires treating the LLM as an unreliable dependency — one that needs input validation, output checking, comprehensive monitoring, and graceful degradation patterns. The teams that succeed are the ones that assume failure and build systems that handle it.

Frequently Asked Questions

What are the most common failure modes in production LLM applications?

The three critical failure modes are: 1) Fragility - prompt injection, drift, and unexpected outputs; 2) Scale - database locks, frontend lag, and spiraling token costs; 3) Observability - silent failures that go unnoticed until users report them.

How do you prevent prompt injection attacks?

Implement defense in depth: input sanitization with allow-list validation, output scanning for harmful content, rate limiting per user, and human-in-the-loop for high-stakes operations.

What observability metrics matter most for LLM applications?

Track token usage and costs per request, latency percentiles (p50, p95, p99), error rates by error type, output quality scores, and user feedback signals.