From Prototype to Production: Productionizing LLM Applications

Moving an LLM application from prototype to production requires more than wrapping an API call. The gap between a demo that works in controlled conditions and a system that handles real-world traffic is where most AI projects fail.

The Fragility Problem

Anyone can build a prompt wrapper. Few can prevent prompt injection, drift, and unexpected user outputs.

Input Sanitization

Your first line of defense is input validation. Don’t just pass user input directly to the LLM:

// ❌ Don't do this
const response = await openai.chat.completions.create({
  messages: [{ role: 'user', content: userInput }],
});

// ✅ Do this
const sanitized = sanitizeInput(userInput);
const response = await openai.chat.completions.create({
  messages: [{ role: 'user', content: sanitized }],
});

Create an allow-list of acceptable patterns and reject anything that doesn’t match. For applications handling sensitive domains, consider using dedicated libraries for detecting jailbreak attempts.

Output Validation

LLMs are non-deterministic. Even with temperature set to 0, outputs can vary. Always validate:

Schema compliance: Does the output match your expected JSON schema?
Content safety: Use secondary models or rules-based systems to scan for harmful content
Coherence checks: Does the response make sense in context?

The Scaling Bottleneck

AI features are useless if your database locks up, your frontend lags, or your token costs spiral out of control.

Token Economics

LLM costs scale with usage. A feature that costs £0.01 per request becomes expensive at 100,000 requests per day.

Strategies to control costs:

Caching: Cache identical requests. Embeddings are deterministic — never compute the same embedding twice
Model tiering: Use smaller, cheaper models for simple tasks; reserve GPT-4 for complex reasoning
Request batching: Group requests when possible to reduce overhead
Streaming responses: Show progress to users while tokens arrive

Database Considerations

LLM applications often need to store:

Conversation history
Generated content
User feedback

Plan your schema for the write patterns you’ll actually see. Append-only conversation logs need different indexing than random-access document stores.

The Observability Gap

If your AI fails silently in front of a client, how long until you notice?

Essential Metrics

Track these at minimum:

Metric	Why It Matters
Token usage per request	Cost forecasting
Latency (p50, p95, p99)	User experience
Error rate by type	Reliability
Output quality scores	Feature health

Distributed Tracing

Every LLM call should be part of a trace:

const trace = telemetry.startSpan('llm.invoke');
try {
  const result = await model.generate(input);
  trace.setAttribute('tokens.used', result.usage.total_tokens);
  return result;
} catch (error) {
  trace.setStatus({ code: SpanStatusCode.ERROR });
  throw error;
} finally {
  trace.end();
}

Production Guardrails

Here’s a production-ready pattern for LLM invocation:

async function invokeLLM(ctx: InvocationContext) {
  // 1. Sanitize input
  const sanitized = sanitizeInput(ctx.input);
  
  // 2. Check cache
  const cached = await cache.get(hash(sanitized));
  if (cached) return cached;
  
  // 3. Start telemetry
  const trace = telemetry.startSpan('llm.invoke');
  
  // 4. Invoke with circuit breaker
  try {
    const result = await circuitBreaker.fire(() =>
      model.generate(sanitized, { signal: ctx.abortSignal })
    );
    
    // 5. Validate output
    assertSafeOutput(result);
    
    // 6. Cache and return
    await cache.set(hash(sanitized), result, ctx.ttl);
    return result;
  } finally {
    trace.end();
  }
}

The Human Element

Even with perfect automation, have a human review process for:

New failure modes discovered in production
Edge cases that bypass filters
User complaints and feedback

Build a feedback loop: when users report bad outputs, capture the full context for analysis.

Summary

Productionizing LLM applications requires treating the LLM as an unreliable dependency — one that needs input validation, output checking, comprehensive monitoring, and graceful degradation patterns. The teams that succeed are the ones that assume failure and build systems that handle it.