1. Mistaking Metrics for Observability
The classic dashboard full of pleasing green graphs is the original sin of monitoring. Relying solely on aggregated metrics—like average latency or CPU usage—is like trying to understand a city by only looking at its total power consumption. When a problem
arises, these averages hide the outliers where the real fire is burning. A single user might be experiencing 10-second load times, but the average latency across millions of users remains a healthy 200ms. A true observability practice doesn't just collect metrics; it captures high-fidelity events and traces. This allows you to go from seeing a system-wide average to isolating the one specific API call for the one specific customer in the one specific region that’s failing. For a system as complex as Gemini 3, averages are statistical lies that obscure the truth.
2. Fearing High-Cardinality Data
Cardinality refers to the number of unique values a piece of data can have. Old-school monitoring systems choked on high-cardinality dimensions like `user_id`, `request_id`, or `session_id`, leading engineers to avoid them. This is a catastrophic mistake in the modern era. Debugging a distributed system without the ability to filter and group by these unique identifiers is nearly impossible. Imagine a user reports a bug. Without the ability to search for their specific `user_id` across all logs, metrics, and traces, you’re flying blind. You can't see their journey through the system or pinpoint where it went wrong. A core tenet of observability is embracing high cardinality. It’s the difference between asking, “What’s our error rate?” and asking, “Show me all the errors experienced by users on the new beta feature flag in the last hour.” One is monitoring; the other is debugging.
3. Bolting on Tracing as an Afterthought
In a monolithic application, a stack trace could often tell you the whole story. In a microservices-based system like Gemini 3, a single user request might fan out across dozens of services. Without distributed tracing, understanding the path and timing of that request is a work of pure fiction. Many teams treat tracing as an optional, advanced feature to be added later. But by then, it’s often too late. The context propagation—the digital breadcrumbs that link one service call to the next—needs to be baked into the core of your application framework from day one. When tracing is an afterthought, it’s often incomplete, inconsistent, and untrustworthy. When it’s a first-class citizen, it becomes the backbone of your debugging process, showing you the exact sequence of events that led to a failure, no matter how deep in the call stack it occurred.
4. Siloing the 'Three Pillars'
Engineers love to talk about the “three pillars of observability”: metrics, logs, and traces. The mistake isn't in the concept, but in the execution. Too many organizations maintain them in separate, disconnected systems. The metrics are in Prometheus, the logs are in Splunk, and the traces are in Jaeger. When an alert fires from the metrics system, the engineer has to manually correlate that timestamp across two other tools, hoping to find a relevant log line or a trace. This friction kills debugging speed. A modern observability platform unites these data types. From a spike in a metric graph, you should be able to click through to see the traces that contributed to that spike, and from a specific span in a trace, you should be able to see the exact logs emitted by that service at that moment. Without that seamless correlation, you don't have a single source of truth; you have three confusing sources of clues.
5. Disconnecting Technical Data from Business Impact
Your observability platform shows that database query P99 latency is up 15%. So what? Does that matter? The ultimate failure is building a system that can tell you everything about its technical performance but nothing about the user experience or business outcomes. The most advanced observability practices connect technical events to business key performance indicators (KPIs). Instead of just alerting on `cpu_usage > 90%`, they alert when `shopping_cart_abandonment_rate > 5%` and then use technical data to find the root cause. This requires instrumenting your code to capture business context. When a checkout fails, you shouldn't just log a generic error; you should log an event with the `order_value`, `customer_tier`, and other business-relevant data. This allows you to prioritize problems based on actual impact, not just technical noise.













