Variant Systems

Monitoring & Alerting Technical Debt

Your monitoring generates noise, not signal. It's time to fix the alerts nobody trusts and the dashboards nobody reads.

At Variant Systems, we pair the right technology with the right approach to ship products that work.

Why this combination

  • Alert fatigue from hundreds of noisy alerts makes monitoring useless
  • Dashboard sprawl with dozens of unused dashboards hides important information
  • Metrics cardinality explosion drives up monitoring costs
  • Missing SLOs mean nobody knows what 'healthy' actually looks like

Alert Fatigue, Dashboard Entropy, and Metrics Nobody Governs

Alert fatigue is the primary symptom. The team created alerts for every metric that seemed important. Over time, thresholds became outdated, services changed, and alerts that once mattered now fire constantly. Engineers mute channels, ignore pages, and re-run failed alerts assuming they’re false positives. When a real incident occurs, the signal is lost in the noise.

Dashboard entropy follows. Each incident spawns new dashboards for investigation. Nobody deletes them afterward. The Grafana instance has 80 dashboards, 60 of which haven’t been opened in months. During an incident, engineers create yet another dashboard because they can’t find the existing one that shows what they need.

Metrics grow without governance. New metrics are added but old ones aren’t removed. Label cardinality explodes when someone adds a user ID or request ID as a metric label. Prometheus storage doubles. The monitoring bill grows. Nobody knows which metrics are actually used.

Auditing Every Alert, Consolidating Dashboards, and Defining SLOs

We audit every alert. For each: when did it last fire? Was it actionable? Did someone respond? Alerts that fire weekly without action are deleted. Alerts that fire rarely and require action get runbooks. Informational alerts move to dashboards. The goal: every page means something is broken and the engineer knows what to do.

Dashboards get consolidated into a hierarchy. Level 0: system overview showing whether everything is healthy. Level 1: service dashboards with RED metrics per service. Level 2: investigation dashboards for deep debugging. Orphaned dashboards for decommissioned services are archived. Every remaining dashboard has an owner.

SLOs formalize what “healthy” means. Error rates below X%. Latency at the 99th percentile below Y milliseconds. Availability above Z%. Alerting shifts from threshold-based (CPU > 80%) to SLO-based (error budget burning too fast). This fundamentally changes alerting from noisy to meaningful.

Taming Label Explosion, Dropping Orphaned Metrics, and Halving the Bill

Monitoring costs frequently spiral because nobody governs metric cardinality. A single metric with a high-cardinality label — user IDs, request paths with dynamic segments, trace IDs — can generate millions of unique time series. Prometheus scrapes slow down. Grafana Cloud or Datadog bills jump. Storage requirements balloon with no corresponding improvement in observability.

We perform a cardinality audit across your entire metrics pipeline. Recording rules in Prometheus pre-aggregate high-cardinality metrics into the summaries you actually query. Labels that provide no diagnostic value get dropped at ingestion using relabel_configs or metric transforms. Dynamic path segments in HTTP metrics get normalized — /api/users/12345 becomes /api/users/{id} — reducing cardinality by orders of magnitude while preserving the data you need for debugging.

For teams on managed platforms like Datadog or Grafana Cloud, we map every custom metric to its consumer. Metrics that feed no dashboard and trigger no alert get removed. Metrics used by a single rarely-accessed dashboard get evaluated for their cost-to-value ratio. The result is typically a 30-50% reduction in billable metric volume with zero loss of operational visibility.

On-Call Engineers Sleep Through the Night Unless Something Genuinely Breaks

On-call becomes sustainable. Alert volume drops 70-80%. Every remaining alert has a runbook and requires action. Engineers trust the system again. Mean time to detection improves because real alerts aren’t buried in noise. On-call engineers sleep through the night unless something genuinely breaks.

What you get

Alert rationalization - reduce alert volume by 70%+
Dashboard consolidation with hierarchical navigation
SLO definition and error budget tracking
Metrics cardinality audit and cleanup
Runbook creation for all actionable alerts
Monitoring cost optimization

Ideal for

  • Teams where on-call engineers ignore most alerts
  • Organizations with Grafana dashboards nobody maintains
  • Companies spending more on monitoring than warranted
  • Teams that want SLO-based alerting instead of threshold-based noise

Other technologies

Industries

Ready to build?

Tell us about your project and we'll figure out how we can help.

Get in touch