Monitoring & Alerting Vibe Code Cleanup
AI added a /health endpoint that returns 200. That's not monitoring. Let's build observability that actually works.
At Variant Systems, we pair the right technology with the right approach to ship products that work.
Why this combination
- AI-generated health checks don't verify actual application readiness
- AI doesn't instrument custom application metrics that reveal real problems
- No alerting means the first sign of trouble is a customer complaint
- AI-generated logging is unstructured and useless for debugging production issues
What AI Gets Wrong in Monitoring
AI generates a /health endpoint that returns { status: "ok" }. This tells you the web server process is running. It doesn’t tell you the database connection is alive, the cache is reachable, or background workers are processing jobs. Your load balancer routes traffic to an instance that returns 200 while every actual request fails because the database connection pool is exhausted.
Application metrics are absent. No request rate tracking. No error rate monitoring. No latency percentiles. The application processes 1,000 requests per minute and you have no idea if that’s normal, if errors are increasing, or if response times are degrading. Without metrics, there’s no baseline, and without a baseline, there’s no way to detect anomalies.
Logging is console.log statements with no structure, no correlation IDs, and no consistent format. When something breaks in production, debugging means SSH-ing into a server and grepping through text files hoping to find something useful.
Our Monitoring Cleanup Process
We instrument the application with proper metrics. Prometheus client libraries expose request count, error count, and latency histograms by endpoint. Custom metrics track business operations - signups, purchases, API calls to third-party services. These metrics form the foundation everything else builds on.
Health checks are rewritten to verify real dependencies. The health endpoint checks database connectivity, cache availability, and critical external service reachability. Kubernetes readiness probes use this endpoint so traffic only routes to instances that can actually serve requests. Liveness probes detect deadlocks and restart stuck processes.
Alerting is configured based on symptoms, not causes. We alert on error rate increases and latency degradation - things users experience. We don’t alert on CPU utilization unless it directly causes user-visible problems. Each alert has a clear threshold, a notification channel, and a runbook.
Before and After
Before: No visibility into production. Problems discovered by customer emails. Debugging requires reproducing the issue locally because there’s no production observability. No historical data to identify trends.
After: Dashboards showing real-time system health. Alerts that fire within minutes of problems starting. Structured logs that make debugging a 10-minute investigation instead of a multi-hour mystery. Historical metrics that reveal performance trends and capacity needs. You know how your application is performing before your users tell you.
What We Typically Instrument
Beyond the standard RED metrics, we add instrumentation tailored to your application’s domain. For SaaS products, that means tracking active sessions, webhook delivery success rates, and queue depths for background jobs. For API-heavy services, we monitor per-client rate limits, payload sizes, and upstream dependency response times. Each metric is chosen because it answers a question someone will ask during an incident or a capacity planning session. We avoid vanity metrics that look good on a dashboard but never trigger a decision. Every metric earns its place by being actionable.
What you get
Ideal for
- Founders whose only monitoring is checking if the site loads manually
- AI-built applications with no visibility into performance or errors
- Teams that discover outages from customer support tickets
- Products that need basic observability before they can scale