Variant Systems

Monitoring & Alerting Vibe Code Cleanup

AI added a /health endpoint that returns 200. That's not monitoring. Let's build observability that actually works.

At Variant Systems, we pair the right technology with the right approach to ship products that work.

Why this combination

  • AI-generated health checks don't verify actual application readiness
  • AI doesn't instrument custom application metrics that reveal real problems
  • No alerting means the first sign of trouble is a customer complaint
  • AI-generated logging is unstructured and useless for debugging production issues

What AI Gets Wrong in Monitoring

AI generates a /health endpoint that returns { status: "ok" }. This tells you the web server process is running. It doesn’t tell you the database connection is alive, the cache is reachable, or background workers are processing jobs. Your load balancer routes traffic to an instance that returns 200 while every actual request fails because the database connection pool is exhausted.

Application metrics are absent. No request rate tracking. No error rate monitoring. No latency percentiles. The application processes 1,000 requests per minute and you have no idea if that’s normal, if errors are increasing, or if response times are degrading. Without metrics, there’s no baseline, and without a baseline, there’s no way to detect anomalies.

Logging is console.log statements with no structure, no correlation IDs, and no consistent format. When something breaks in production, debugging means SSH-ing into a server and grepping through text files hoping to find something useful.

Our Monitoring Cleanup Process

We instrument the application with proper metrics. Prometheus client libraries expose request count, error count, and latency histograms by endpoint. Custom metrics track business operations - signups, purchases, API calls to third-party services. These metrics form the foundation everything else builds on.

Health checks are rewritten to verify real dependencies. The health endpoint checks database connectivity, cache availability, and critical external service reachability. Kubernetes readiness probes use this endpoint so traffic only routes to instances that can actually serve requests. Liveness probes detect deadlocks and restart stuck processes.

Alerting is configured based on symptoms, not causes. We alert on error rate increases and latency degradation - things users experience. We don’t alert on CPU utilization unless it directly causes user-visible problems. Each alert has a clear threshold, a notification channel, and a runbook.

Before and After

Before: No visibility into production. Problems discovered by customer emails. Debugging requires reproducing the issue locally because there’s no production observability. No historical data to identify trends.

After: Dashboards showing real-time system health. Alerts that fire within minutes of problems starting. Structured logs that make debugging a 10-minute investigation instead of a multi-hour mystery. Historical metrics that reveal performance trends and capacity needs. You know how your application is performing before your users tell you.

What We Typically Instrument

Beyond the standard RED metrics, we add instrumentation tailored to your application’s domain. For SaaS products, that means tracking active sessions, webhook delivery success rates, and queue depths for background jobs. For API-heavy services, we monitor per-client rate limits, payload sizes, and upstream dependency response times. Each metric is chosen because it answers a question someone will ask during an incident or a capacity planning session. We avoid vanity metrics that look good on a dashboard but never trigger a decision. Every metric earns its place by being actionable.

What you get

Application metrics instrumentation (request rate, error rate, latency)
Real health check endpoints that verify database, cache, and dependency connectivity
Alerting setup with meaningful thresholds and escalation
Grafana dashboards for system health and service-level visibility
Structured logging implementation
Uptime monitoring from external locations

Ideal for

  • Founders whose only monitoring is checking if the site loads manually
  • AI-built applications with no visibility into performance or errors
  • Teams that discover outages from customer support tickets
  • Products that need basic observability before they can scale

Other technologies

Industries

Ready to build?

Tell us about your project and we'll figure out how we can help.

Get in touch