What breaks when you ship AI in production

January 28, 2026 · by Akarsh

Shipping AI in production is very different from experimenting with AI in a demo or side project.

I have worked on and shipped multiple AI systems that reached real users and real traffic. The hard part was never calling an AI API and getting a response back. That part is easy. The real challenges start when AI becomes a core part of your production workflow.

Once AI requests become large, slow, and asynchronous, you stop building features and start running a distributed system.

Failure modes I did not expect

The first production AI system I shipped did not fail loudly. It failed quietly.

User clicks "Generate"
→ API enqueues background job
→ API returns 200 OK

Everything meaningful happened after the HTTP request had already completed.

In one case, a background job crashed after a partial model response. No exception bubbled up. No retry triggered. The user waited indefinitely for a WebSocket message that never arrived.

In another case, a model returned valid JSON that passed schema validation but was semantically wrong. Downstream code treated it as success, and corrupted state propagated silently.

Schema validation can ensure structure, but it cannot tell you whether the content is meaningfully correct. That distinction only became obvious after things broke.

Provider outages and the illusion of reliability

Early on, I assumed model providers would be “mostly up” and that occasional errors were acceptable.

In reality, providers degrade in subtle ways. Latency spikes. Requests start timing out. Partial responses appear. Entire regions go unavailable without a clear outage signal.

When AI is deeply embedded in your workflow, a single degraded provider can stall the entire application. Without a strategy for handling provider failure, the app effectively goes dark.

Why async AI failures go silent

Most AI integrations are written as if the system is synchronous, even when it is not.

POST /generate
  enqueue(job)
  return 200

Everything important happens after the response is sent. If a job stalls, times out, or crashes, the user-facing request has no way to reflect that unless state is explicitly tracked.

Retries often made this worse. A retry might eventually succeed, masking the original failure and making root-cause analysis difficult.

Before and after: making async state explicit

One of the biggest improvements came from making async state transitions explicit instead of implicit.

// BEFORE: implicit state
async function generate(jobId, input) {
  const result = await callAIModel(input)
  await saveResult(jobId, result)
  await notifyClient(jobId, result)
}

// AFTER: explicit state tracking
async function generate(jobId, input) {
  updateState(jobId, "started")

  try {
    updateState(jobId, "model_call_started")
    const result = await callAIModel(input)

    updateState(jobId, "model_call_succeeded")
    await saveResult(jobId, result)

    updateState(jobId, "completed")
    await notifyClient(jobId, result)
  } catch (err) {
    updateState(jobId, "failed", { error: err.message })
    throw err
  }
}

Once state transitions were persisted, failures became boring. It was always clear where a request stopped progressing.

Cost explosions and missing rate limits

One painful lesson was how quickly costs can spiral when AI endpoints are exposed without strict limits.

A single misbehaving client or retry loop can generate thousands of requests in minutes. Because AI usage maps directly to cost, small bugs can turn into billing incidents overnight.

Treating rate limiting and quotas as optional is a mistake that only has to happen once.

Debugging without observability is guesswork

When something went wrong, the first question was always the same: “Is the model slow, or is the system slow?”

In one system, roughly 70% of perceived model slowness turned out to be job backlog and coordination delay, not model execution.

Without end-to-end request timelines, state visibility, and error context, debugging becomes speculation instead of diagnosis.

Integration testing was burning money

During early development, most failures had nothing to do with model quality. Prompts changed. Response structures evolved. Parsers broke. State handling was wrong.

Yet every failed attempt still resulted in a real API call and real cost.

// Deterministic response for integration testing
return {
  status: "completed",
  output: {
    title: "Sample title",
    summary: "Sample summary"
  }
}

Separating workflow correctness from model correctness made iteration faster and dramatically cheaper.

Structured output prevents entire classes of bugs

One recurring source of breakage was unstructured or partially structured model output.

Ad-hoc parsing, regex extraction, and best-effort assumptions worked in demos, but failed unpredictably in production.

Enforcing a strict output structure eliminated whole categories of downstream bugs that were otherwise hard to detect.

Local development was lying to me

Locally, jobs ran instantly. WebSockets never disconnected. Timeouts never triggered.

The first real failures only appeared after deployment. Being able to observe async execution locally changed how early problems were caught.

What I wish I had known earlier

Async AI systems fail by omission, not exception.
Retries without visibility hide more bugs than they fix.
Provider outages are inevitable, not rare.
Most AI “slowness” is coordination and queueing.
Unbounded usage turns bugs into billing incidents.
Structured output prevents entire classes of failure.

Once I started treating AI workflows as distributed systems first and model calls second, failures became easier to reason about and far less surprising.

Closing thoughts

Developers should not have to rediscover these lessons the hard way just to ship a production-grade AI application.

I’m currently working on ModelRiver, which grew out of repeatedly hitting these exact problems while shipping AI systems in production.

Thanks for reading. You can find me on X/@akarshcp.