What does cardinality mean in Prometheus?Blog

Avoid High Cardinality is a commonly heard Prometheus tip. But what is high cardinality and what does it mean in the context of Prometheus?

post-thumb

BY Julien Pivotto / ON Nov 29, 2023

What is Cardinality?

Cardinality is a term used in mathematics, particularly in set theory, to denote the number of elements in a set.

In the context of Prometheus monitoring, cardinality refers to the number of unique time series that are being monitored.

A high cardinality metric is one that has a large number of unique time series, while a low cardinality metric is one with fewer unique time series.

If you have provisioned your Prometheus server to handle 6M time series, and you have 500 instances of your application, adding a simple counter would add ~1000 series, which would be less than 0,1% of your total time series. That is pretty cheap, so it is fine.

Now, if you go for a classic histogram, with a label that has 10 values, that would be 120 time series (every label value would produce 10 buckets, a _count, and a _sum). It would take 1% of your Prometheus capacity.

In this case, it makes sense to wonder if that’s really worth it, and if you should instead go for a quantile-less summary instead. The quantile-less summary only produces _sum and _count, which can be a cheap way to notice latency spikes.

Here is an example of a histogram with 10 buckets, producing 12 metrics for each value of the handler label.

It comes from the Prometheus server, which currently has 20 different values for handler in our setup:

# HELP prometheus_http_request_duration_seconds Histogram of latencies for HTTP requests.
# TYPE prometheus_http_request_duration_seconds histogram
prometheus_http_request_duration_seconds_bucket{handler="/",le="0.1"} 4
prometheus_http_request_duration_seconds_bucket{handler="/",le="0.2"} 4
prometheus_http_request_duration_seconds_bucket{handler="/",le="0.4"} 4
prometheus_http_request_duration_seconds_bucket{handler="/",le="1"} 4
prometheus_http_request_duration_seconds_bucket{handler="/",le="3"} 4
prometheus_http_request_duration_seconds_bucket{handler="/",le="8"} 4
prometheus_http_request_duration_seconds_bucket{handler="/",le="20"} 4
prometheus_http_request_duration_seconds_bucket{handler="/",le="60"} 4
prometheus_http_request_duration_seconds_bucket{handler="/",le="120"} 4
prometheus_http_request_duration_seconds_bucket{handler="/",le="+Inf"} 4
prometheus_http_request_duration_seconds_sum{handler="/"} 0.00011596600000000001
prometheus_http_request_duration_seconds_count{handler="/"} 4

Obviously this was assuming you are running 500 instances of your application. You’d need to use your own judgment and adapt this to your setup. Also, some core features of your applications might be worth the 1% total usage.

Keep in mind that when you have 10 values for a label to start with, as the application evolves, in one year, it can be 20 or 30 values. This is something you should consider as well when designing your metrics.

What can you do to limit cardinality?

At some point, you have to be ready to work with logs to address this cardinality. Metric labels don’t have to reflect exactly everything. Get the right tooling to jump into the logs to get all the needed details.

If you take it to the extreme, I had a specific customer where an application produce a metric with a cardinality of 125.000 (across multiple instances of a application). That high cardinality metric ended up being 3% of the Prometheus capacity. However, those metrics were the most business critical metrics and therefore the tradeoff was deemed acceptable.

If you need help with maintaining your cardinality into bounds, you can get in touch with us.

Share:

Subscribe
to our newsletter