Why Observability Comes Before Optimization | Niranjan Sah

The Optimization Trap

The conversation usually starts the same way. An engineer notices something in production - a slow endpoint, a high-latency request, a service consuming more memory than expected. They have a theory about why. They have an idea for a fix.

"Let's optimize the retrieval layer," someone says. "The database queries are the bottleneck." Or: "We should add caching here." Or: "The model is slow. We need a faster model."

These are reasonable hypotheses. They're also, in my experience, wrong roughly half the time.

I've watched engineers - including myself - spend two weeks building an optimization that addressed a problem that didn't exist in the way we thought. The cache we added was in the wrong layer. The query we optimized wasn't the slow one. The model we blamed for latency was sitting idle most of the time while the retrieval layer fed it data at a third of the speed it could handle.

The pattern was consistent: we optimized before we measured. We had a theory. We skipped the part where you find out whether the theory matches reality.

This is the optimization trap. It doesn't look like negligence. It looks like productive engineering. But you're deciding what to solve before you know what the problem actually is.

What Observability Actually Means

Observability is not logging. Logging is a component of observability - necessary but not sufficient. A system with rich logs and no way to query them, correlate them, or understand the shape of the data is not observable. It's just verbose.

Observability is the property of a system that lets you understand its internal state from its external outputs. More concretely: it's the combination of four things working together.

Metrics are numerical measurements aggregated over time. They're what you use to track whether something is healthy. Latency percentiles, error rates, throughput, resource utilization. Metrics answer: is this working, and how well?

Metrics have a cost: aggregation. When you compute a latency average, you lose the distribution. When you track error count, you lose error type. This is why metrics alone aren't enough.

Logs are individual event records. They're what you use when you need to understand a specific request or a specific failure in detail. Logs answer: what happened on this specific path?

The problem with logs is volume and structure. Without a consistent schema and a way to query them efficiently, logs become archaeology - you have to dig through them to find anything.

Traces are records of a single request as it moves through a system. They're what you use to understand latency composition and dependency relationships. A trace shows you: this request hit the API gateway at time T, called the retrieval service at T+5ms, returned at T+45ms, then the inference layer took 200ms. Without traces, you don't know which step is slow. With traces, you can see the full picture.

Traces are the most underused of the four in my experience. Engineers who haven't worked in systems with distributed tracing often don't appreciate how much latency attribution changes when you can see where time actually goes.

Dashboards and alerting are the presentation and action layers. Dashboards let you see system health at a glance. Alerting lets you know when to care.

The four components are complementary. Metrics tell you something is wrong. Logs tell you what happened. Traces tell you where it went wrong. Alerting tells you when to care.

Problems I Couldn't Understand Until I Had Visibility

Latency that wasn't where I expected

I worked on a service that was serving AI inference requests. The API had a latency target - something reasonable, in the low hundreds of milliseconds. Production was missing it consistently.

The prevailing theory was the inference layer. Large language models are slow. That must be where the time was going.

I added tracing to the request path. What I found: inference was taking 80 milliseconds. The retrieval step - pulling the context documents and formatting the prompt - was taking 220 milliseconds. We had optimized inference because we thought inference was the problem. The actual bottleneck was a synchronous retrieval call that was fetching documents one at a time instead of in parallel.

Once I could see the latency composition, the fix was obvious - batch the retrieval calls. We hadn't changed the model. We hadn't changed the inference infrastructure. We had fixed the actual problem, which we'd have seen immediately if we'd measured first.

A caching layer that wasn't working

We added a cache to an inference endpoint. Memory-backed, short TTL, keyed on the input prompt hash. The cache hit rate looked fine in testing - we were getting 60, 70 percent hit rates in staging.

Production was different. Cache hit rate was under 20 percent. Nobody had noticed because the cache was returning responses - they were just stale responses being returned for the wrong reasons.

I traced the cache path and found the problem: the retry logic in the inference client was retrying failed requests with a slightly different prompt format each time - adding a whitespace normalization that varied by client version. The cache key didn't account for this. Every retry was a cache miss. Every cache miss was a new inference request. We had a retry storm that looked like a cache problem from the metrics.

Resource usage that seemed anomalous

A service was consuming CPU in a pattern nobody could explain - a steady baseline that would occasionally spike to three times normal, for about ten minutes at a time, twice a day. The spikes didn't correlate with traffic. They didn't correlate with any scheduled job anyone knew about.

The initial theories: a memory leak causing GC pressure, a background worker misbehaving.

I instrumented the process with more granular metrics - per-thread CPU, garbage collection frequency, heap size over time. What I found: the spikes were the inference layer warming up after a model reload. The model would reload every twelve hours, and the warm-up period - loading weights into GPU memory, running initial inference to prime the GPU - was expensive. It happened regardless of traffic.

Why Optimization Without Visibility Fails

Wrong assumptions are the default. Every engineer who has worked on a performance problem has a theory about the root cause before they've measured anything. This is human. It's also the source of most wasted optimization work. The theory is often wrong because production systems are genuinely complex, and the behavior of a system under load is often counterintuitive.

Local optimization damages global performance. An engineer optimizes a single service to reduce its latency by 30 percent. The overall request latency barely changes, because the service was only 15 percent of the total path. Meanwhile, the optimization introduces a new failure mode. Optimization is an allocation problem as much as a performance problem.

Symptoms are not root causes. High memory usage is a symptom. The root cause might be expected growth under load, or a configuration that allocates too much for the workload. Error rate spikes are a symptom. The root cause might be a dependency failure, or a deployment that introduced a change. Without visibility into the chain of causes and effects, you're guessing.

Observability in AI Systems

AI systems have a specific observability challenge that conventional systems don't face: the inference step is opaque by nature, and the surrounding infrastructure is often under-instrumented because engineers assume the model is where all the interesting behavior happens.

It's usually not.

Request flows in AI systems are longer and more complex than in conventional services. A typical AI inference request might pass through: an API gateway, an authentication layer, a request validation step, a retrieval layer, a prompt construction step, an inference call, and a response parsing step. Each of these is a potential bottleneck. None of them is the model, but the model gets the blame because it's the expensive, obvious component.

Token usage is a metric most AI platforms don't track at the per-request level. Total tokens per day, yes. Tokens per request, rarely. Without per-request visibility, you can't understand which requests are expensive, which users are driving costs, or whether cost is concentrated in a small number of requests.

Retrieval quality is almost never instrumented. In RAG-based systems, the quality of the retrieved context directly determines inference quality. If the retrieval layer is returning irrelevant documents, the model will produce lower-quality responses. But unless you're measuring retrieval precision, you have no visibility into this failure mode.

What I Would Instrument First

If I were instrumenting a new system from scratch, here's what I would prioritize, in order.

Step 1: Request tracing end-to-end, from day one. This is the highest-value investment and the one most likely to be deprioritized because it feels like overhead. Add distributed tracing to every service boundary before the system is complex enough to need it. Retrofitting tracing is painful; instrumenting from the start is cheap.

Step 2: Latency percentiles, not averages. Track p50, p95, p99. Averages hide the distribution. If p99 latency is ten times p50, you're serving some users very badly and you won't see it from the average.

Step 3: Error rates by type, not just count. A system with one error type at 5 percent is different from a system with five error types each at 1 percent. The fix is different. Track error types.

Step 4: Resource utilization per service. CPU and memory are the starting point. For AI systems, add GPU utilization, token throughput, and batch queue depth. These tell you whether you're underutilizing expensive hardware or pushing into saturation.

The rule I follow: if I can't answer the question "what's the slowest part of this request path right now?" in under a minute with the existing instrumentation, the instrumentation is insufficient.

Lessons Learned

Measure first, optimize second. I know this sounds obvious. I've watched it be violated enough times that it's worth saying. The optimization you build before measuring is usually wrong in at least one important way.

Correlate metrics before you act. Latency is up. Is it because inference is slow? Because retrieval is slow? Because the network is slow? Each has a different fix. Correlate across layers before you decide where to act.

Define SLOs before you need them. An SLO is a contract with your users about what "good enough" looks like. If you define it before you have a problem, you know when you've violated it. Latency SLO, availability SLO, error rate SLO - start with those three.

Observability is not a feature. It's infrastructure. It doesn't ship with a product. Treating it as optional means you find out it's insufficient when you most need it.

Final Thoughts

The most expensive engineering mistake I've made - and watched others make - was building something we were confident about instead of building something we understood.

Confidence without measurement is expensive. You ship an optimization that doesn't address the real problem. You debug for an hour in the wrong place. You miss an incident because your alerts are measuring the wrong thing.

Understanding requires information. Information requires instrumentation. The engineers I most respect in production systems are the ones who, when something breaks, can usually tell you exactly what happened, when, and why - not because they're brilliant, but because they built a system that made that kind of visibility possible.

You can't optimize what you can't see. And you can't see what you haven't instrumented. The sequence is fixed: observability first. Everything else follows from there.