Daisy Huang
LLMEngineeringProduction

LLM Engineering in Production: What They Don't Tell You

Daisy Huang··3 min read

Deploying a demo that uses GPT-4 is easy. Deploying a production LLM system that a clinical team trusts with patient data is something else entirely.

Over the past two years I've been building AI-assisted tooling inside healthcare — clinical documentation assistants, prior auth automation, patient triage support. Each of these has taught me something the blog posts don't mention.

Prompt Drift Is Real and Gradual

The model you're prompting today is not the model you were prompting six months ago. OpenAI, Anthropic, and Google all do silent updates. Your evals may still pass, but user-facing behavior will shift. I've seen carefully tuned clinical summarisation prompts start producing subtly different output structures after a model update — nothing catastrophic, but enough to break downstream parsing.

What to do: Pin model versions wherever your vendor allows. Write evals against outputs, not just against a rubric. Snapshot representative outputs monthly and diff them.

Latency Is a UX Problem, Not an Infrastructure Problem

Teams obsess over throughput and P99 latency for their APIs. But for clinicians using an AI assistant mid-consultation, what matters is perceived latency. A 4-second first-token delay feels like an eternity when a patient is in the room.

Techniques that actually help:

  • Streaming responses (even a small trickle gives the feeling of responsiveness)
  • Aggressively cache anything deterministic — patient summaries, static lookups
  • Move non-blocking LLM calls off the critical path

Hallucination Is a Calibration Problem

"Zero hallucination" is not a real target — it's a marketing line. The real goal is calibrated uncertainty. Your system should know when it doesn't know.

In practice this means:

  • Building explicit retrieval steps (RAG) over verified sources rather than relying on parametric knowledge
  • Designing UI that communicates confidence levels
  • Red-teaming your prompts with adversarial inputs before launch, not after

Evaluation Is the Actual Hard Part

You can build an LLM pipeline in a weekend. A reliable evaluation harness takes months. Without evals, you're flying blind on quality regressions.

What we've found effective:

  • LLM-as-judge for nuanced quality (surprisingly robust when the judge prompt is well-designed)
  • Deterministic unit tests for format, length, and required fields
  • Human spot-check rotations for anything patient-facing

Closing Thoughts

Production LLM engineering is mostly systems engineering with a layer of uncertainty on top. The uncertainty is manageable — but only if you build the infrastructure to observe, measure, and respond to it. Start with evals, not with the model.

← All posts