The Initial Magic of Monitoring
For any engineer teaching themselves modern operations, the Prometheus and Grafana stack feels like a superpower. You install an exporter, point Prometheus at it, and suddenly you have a time-series database full of rich data about your application. You connect
it to Grafana, drag and drop a few panels, and voilà—a slick, real-time dashboard tracking CPU usage, request latency, or error rates. This initial success is intoxicating. It’s the “hello, world” moment of observability. You’ve gone from flying blind to having a fully instrumented cockpit. The natural next step is to think, “What else can I track?” You see the power of Prometheus’s label system, which lets you slice and dice your data. And that’s where the trap is set.
The All-You-Can-Eat Label Buffet
Prometheus metrics are defined by a name and a set of key-value pairs called labels. For example, a metric for HTTP requests might look like `http_requests_total{method="POST", handler="/api/users"}`. This is incredibly powerful. You can query the total count, or filter by method, or group by handler. The mistake many self-taught engineers make is treating labels as a free-for-all logging system. They think, “More data is always better!” So they start adding labels for everything: `user_id`, `request_id`, `session_token`, `client_ip`. On the surface, it seems brilliant. You could theoretically track the exact activity of a single user. But in doing so, you’ve just unknowingly pointed a firehose at your database and are about to flood your entire system. This well-intentioned decision is the root of the problem, and it all comes down to a single concept: cardinality.
The Real Meaning of Cardinality
This is the hidden detail: cardinality. In the context of Prometheus, cardinality refers to the number of unique time series a metric generates. A time series is a unique combination of a metric name and its label pairs. Let’s use an analogy. Imagine you sell t-shirts. Your metric is `tshirts_sold`. You add two labels: `color` (red, blue, green) and `size` (S, M, L). The number of unique time series is 3 colors * 3 sizes = 9. This is low cardinality. Your database can easily handle tracking nine distinct series. Now, imagine you add a label for `customer_id`. If you have 100,000 customers, you have just created 100,000 unique time series for your `tshirts_sold` metric. This is high cardinality. If you also added `order_id`, the number would explode into the millions. Each of these unique series must be stored, indexed, and processed by Prometheus. You’re no longer tracking broad trends; you’re tracking infinite, individual events. Prometheus is not designed for that—that's a job for logging or tracing systems.
How High Cardinality Breaks Everything
The consequences of high cardinality are severe and often misdiagnosed. Engineers will blame Grafana for being slow or assume their server needs more RAM, when the real issue is the data structure. First, Prometheus's memory and CPU usage will skyrocket. It has to keep all those unique time series in memory for recent data, leading to performance degradation or crashes (OOM kills). Ingesting new data slows down, and the whole system becomes brittle. Second, your Grafana dashboards become useless. When you run a query like `sum(http_requests_total)`, Prometheus has to churn through millions of series just to add them up. Queries that once took milliseconds now take minutes or time out completely. Your beautiful, responsive dashboard now shows a spinning loader and a “Query timed out” error. Alerting based on these metrics also becomes unreliable, defeating the entire purpose of your monitoring setup.
The Right Way to Use Labels
The fix is conceptual, not technical. You must shift your thinking about what labels are for. Labels should represent a small, finite number of dimensions you want to aggregate, group, or filter by. They are for partitioning your system, not identifying unique events. Here’s a simple rule of thumb: if a label’s value can be almost anything (like a user ID, a timestamp, or an email address), it does not belong as a Prometheus label. Stick to things that have a limited set of possible values. Good labels include: - `method`: (GET, POST, PUT) - `status_code`: (200, 404, 503) - `environment`: (production, staging) - `region`: (us-east-1, eu-west-2) Bad labels include: - `user_id`: (123, 456, 789, ...) - `request_id`: (a-unique-guid) - `ip_address`: (any IP in the world) If you need to associate a metric with high-cardinality data like a user ID, that information belongs in your logs or a distributed tracing system, which are designed to handle it. You can then correlate timestamps between your metrics and your logs when you need to investigate a specific event.













