Flipkart engineers recently published a detailed case study describing how they overcame severe scalability limits in monitoring by adopting a hierarchical federation design in Prometheus. The migration was driven by their API Gateway layer, where approximately 2,000 instances each emitted roughly 40,000 metrics, resulting in a staggering 80 million time-series data points being produced simultaneously.

Initially, Flipkart used StatsD for metrics aggregation, but found it failed to scale. Queries over longer durations choked storage and made historical analytics impractical. To address this, they transitioned to Prometheus, which better supports high-dimensional queries and integrates well with Kubernetes and exporter ecosystems.

Federated servers then scrape selected aggregated metrics upward, writing them to long-term storage and dashboards. This tiered design dramatically reduces metric cardinality and load on central servers. To reduce cardinality further, Flipkart adopted strategies such as dropping the instance label for stable dimensions like service or cluster. For latency metrics (p95, p99), they published summary statistics (average, max, min) rather than per-instance series. This approach collapsed 80 million raw series into tens of thousands of cluster-level metrics.

The authors also caution about tradeoffs: hierarchical federation is less useful when you need to debug per-instance anomalies across clusters or when working with small deployments where a single Prometheus instance suffices. They advise against blindly mirroring raw metrics upward and instead emphasize filtering and selective federation.

Flipkart’s experience underscores the limits of flat monitoring architectures at scale and illustrates how federation, aggregation rules, and label pruning can make high-scale observability manageable. Their approach offers a practical blueprint for organizations facing explosive metric growth in cloud-native environments.

While Flipkart’s hierarchical federation approach offers simplicity and strong alignment with native Prometheus, other organizations facing similar scaling challenges have turned to distributed systems like Thanos, Cortex/Mimir, or VictoriaMetrics. Each tackles scale and retention in different ways.

Thanos, for example, extends Prometheus with long-term storage and global querying capabilities. Rather than relying on multi-tier federation, Thanos attaches sidecar components to each Prometheus instance, which then upload metrics to object storage (e.g., S3 or GCS). A centralized Querier can query across clusters, offering a unified view without the need to aggregate or drop labels. This eliminates manual federation setup but introduces additional components and operational overhead. Compared to Flipkart’s hierarchical model, Thanos offers stronger global query visibility but requires managing more infrastructure and storage.

Cortex (and its modern fork, Grafana Mimir) takes a more cloud-native, horizontally scalable approach. Metrics are stored in a multi-tenant distributed time-series database built on microservices like distributors, ingesters, and queriers. Unlike Flipkart’s hierarchical federation, Cortex and Mimir use consistent hashing and object stores to shard data automatically, enabling massive scale and durability with less manual configuration. The trade-off is complexity: deploying and tuning Cortex or Mimir requires expertise in distributed systems, while Flipkart’s approach can be implemented with native Prometheus tools.

VictoriaMetrics, on the other hand, takes a middle ground. It supports Prometheus remote write and read APIs but emphasizes performance and compression efficiency. Its single-binary setup makes it operationally lightweight, yet it can still handle tens of millions of metrics. For teams prioritizing ease of deployment and efficient long-term retention, VictoriaMetrics often provides a simpler alternative to both federation and Thanos-style architectures.

Flipkart’s choice to rely on hierarchical federation rather than these distributed backends reflects a balance between control, simplicity, and incremental scalability. It keeps Prometheus largely stateless, avoids external dependencies like object storage, and maintains native query semantics critical in large, multi-cluster Kubernetes environments.

However, as the organization scales further, adopting hybrid architectures that pair federation with long-term backends (like Thanos or VictoriaMetrics) could further enhance retention, cross-cluster querying, and resilience.
https://www.infoq.com/news/2025/10/flipkart-prometheus-80million/?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=global

By admin

Leave a Reply

Your email address will not be published. Required fields are marked *