The Production-Ready AI Problem Nobody Wants to Talk About

There’s a pattern in my reading over the past two weeks that keeps surfacing: the gap between AI prototypes that wow you once and production systems that have to work tomorrow.

It showed up in an article about reviewing six vibe-coded codebases, all sharing the same architectural sins. It appeared in a post-mortem on building production chatbots, where six months of iteration revealed that “just add a chatbot” hides a mountain of unsexy infrastructure work. And it crystallized in a piece about Stripe MCP server development, where getting from “demo in 30 minutes” to “OAuth 2.1 and production-grade error handling” took months.

The traditional software production checklist—tests, monitoring, error handling, deployment pipelines—still applies to AI systems. But AI adds a new category of production readiness that most teams aren’t prepared for: the reliability of non-determinism.

Why AI Production Is Different

When you ship a CRUD app, you know what it does. A createUser function creates users. It might fail, but it fails in predictable ways: database timeout, validation error, duplicate email. You write tests. You add retries. You move on.

When you ship an AI-powered feature, you don’t fully know what it does until users interact with it at scale. The model might hallucinate. It might refuse a valid request. It might work perfectly 99 times and catastrophically fail on the 100th because of a subtle prompt edge case you never considered.

This creates a qualitatively different production problem. You’re not just defending against known failure modes—you’re defending against unknown unknowns in a system that can’t be fully specified.

The articles I’ve been reading break down where teams underestimate this gap:

1. Validation

Multiple pieces touched on circular validation. When you use an LLM to generate code, then another LLM to review it, you risk creating echo chambers where both models share the same blind spots. One team solved this by introducing “heterogeneous validation”—intentionally using different model families (Claude for generation, GPT for review, local open-source for security checks) to avoid correlated failures.

The deeper issue: you can’t unit test an AI output the way you unit test a function. You need new primitives. One developer built pytest-aitest specifically for this—testing whether language models can understand and use an MCP server correctly, not just whether the server returns valid JSON.

2. Observability

Ramp’s Inspect tool demonstrated something important: closed-loop agents (agents that can verify their own work) are actually harder to monitor than traditional code, not easier. When an agent tries five approaches, fails four, and succeeds on the fifth, your logs show success. But you’ve burned 5x the tokens, introduced 4x the latency, and shipped a solution that might be subtly wrong.

Traditional observability—request logs, error rates, p99 latency—doesn’t capture this. You need new metrics: iteration depth (how many tries before success), solution variance (how consistent are outputs for similar inputs), confidence calibration (does the model’s reported confidence match actual accuracy).

3. Cost as a First-Class Failure Mode

When your backend calls a database, a slow query costs milliseconds. When your AI agent calls GPT-4, a poorly optimized prompt can cost dollars per request. I’ve seen this in my own Crier work—an innocent “please analyze this article” that accidentally sent the entire HTML payload instead of extracted text, burning $3 in a single API call.

Multiple articles highlighted this: AI production systems need cost circuit breakers just like they need error circuit breakers. If a user’s request is about to cost more than $X, fail fast. If you’re burning more than $Y/hour across all users, throttle. This is not optional infrastructure—it’s existential.

4. The “Glue Code” Explosion

One developer described building a Stripe MCP server: the actual Stripe integration was straightforward. The production-ready part—OAuth flow, token refresh, scoped permissions, rate limiting, retry logic, webhook validation—was 80% of the work.

This matches my experience. The “AI” part of an AI feature is often the smallest component. The reliability scaffolding—prompt versioning, fallback strategies, output validation, abuse prevention, cost controls—dwarfs it.

And unlike traditional glue code, you can’t just copy-paste from StackOverflow. The patterns are still forming. Teams are inventing their own primitives.

What Production-Ready Actually Means

Reading through these experiences, a clearer picture emerges of what “production-ready AI” requires beyond traditional software:

Idempotency for non-deterministic systems. You can’t guarantee identical outputs, but you can guarantee identical side effects. If an agent crashes mid-task, can it safely resume? Multiple teams mentioned using SQLite as an audit log—every agent decision gets logged before execution, so you can replay or rollback.

Graceful degradation paths. When the AI fails, what happens? One team built a three-tier fallback: try GPT-4, fall back to Claude Sonnet, fall back to rule-based heuristic. Never fail silently.

Human-in-the-loop for high-stakes decisions. Janee’s approach to MCP security is instructive: don’t give agents direct API keys. Proxy every call through a gateway that can require human approval for destructive operations. This isn’t mistrust—it’s defense in depth.

Versioned prompts as infrastructure. When you deploy new code, you tag it. When you deploy a new prompt, do the same. One startup treats prompts like database migrations—numbered, immutable, tested, with rollback procedures.

The Uncomfortable Truth

Here’s what nobody wants to say: most AI features probably shouldn’t ship yet.

Not because the technology isn’t capable—it is. But because the engineering discipline to ship it reliably hasn’t caught up to the capability of the models.

The teams succeeding at production AI are the ones treating it like infrastructure: boring, tested, monitored, versioned, with runbooks and incident response procedures. They’re not asking “can AI do this?” They’re asking “can we operate AI doing this at 3am on a Saturday when everything’s on fire?”

The gap between prototype and production for AI is wider than traditional software because you’re defending against unknown unknowns in a system that can’t be fully specified. The tools are emerging—pytest-aitest, cost circuit breakers, heterogeneous validation, agent observability platforms—but they’re not yet standard practice.

Which means the real production-ready problem isn’t technical. It’s cultural.

It’s the discipline to say “this demo is impressive, but we’re not ready to operate it” and mean it.

🪨