Logging & Tracing Technical Debt
Your logs are unstructured text scattered across servers. Time for logging that actually helps you debug production.
At Variant Systems, we pair the right technology with the right approach to ship products that work.
Why this combination
- Unstructured logs make incident investigation painfully slow
- Missing correlation IDs prevent tracing requests across services
- Log retention policies that don't exist lead to either lost data or excessive costs
- Inconsistent logging across services creates blind spots
Inconsistent Formats, Missing Request IDs, and Retention Gone Wrong
The most common debt: inconsistent logging across services. Each service was built by different engineers at different times. One uses structured JSON. Another uses text format. A third uses the framework’s default logger with no configuration. Field names differ. Timestamp formats differ. Severity levels differ. Querying across services requires knowing the peculiarities of each.
Missing context is the second pattern. Log messages say what happened but not enough to understand why or for whom. “Order processing failed” - which order? Which user? What error? Engineers who wrote the code can often deduce context from the message. Everyone else is lost. This institutional knowledge dependency means only the original authors can debug production effectively.
Retention mismanagement is the third pattern. Either logs are kept forever (expensive) or deleted after a week (insufficient for investigations that start late). No tiered retention where recent logs are in hot storage and older logs are archived cheaply. The organization pays premium rates for logs nobody queries.
Standardizing Structured JSON, Centralizing Collection, and Redacting PII
We standardize logging across all services. A shared logging library or configuration ensures consistent format, field names, and severity levels. Migration happens service by service, typically during other maintenance work. Each migration adds request ID propagation and ensures error paths log sufficient context.
Centralized collection replaces per-server log files. We implement a log pipeline that ships structured logs to a searchable platform - Loki for Grafana-based stacks, Elasticsearch for teams that need full-text search, or managed services for teams that prefer simplicity. Retention policies match actual needs: 30 days in hot storage for active debugging, 90 days in warm storage for investigation, archives for compliance.
PII redaction is implemented at the logging layer. Fields containing email addresses, phone numbers, or other PII are masked before they leave the application. This is cheaper and more reliable than filtering in the log pipeline because it prevents PII from ever being transmitted.
Minutes to Debug Instead of Hours, Plus Log-Based Alerting That Catches Issues Early
Incident investigation time drops dramatically. Engineers search centralized logs with structured queries instead of SSH-ing into servers and grepping text. Request IDs trace individual requests across services. Error context tells the full story. What took hours of detective work now takes minutes of focused querying.
Log-based alerting becomes possible once structure is in place. Instead of waiting for metrics to reflect a problem, you can trigger alerts directly from log patterns: a spike in authentication failures, a sudden increase in 5xx responses from a specific upstream, or the first appearance of an out-of-memory error. Tools like Loki’s LogQL or Elasticsearch’s Watcher make this straightforward when every log entry has consistent fields. These alerts often catch issues five to ten minutes earlier than metric-based detection because the log entry is the first signal, while the metric is a lagging aggregation of many such entries.
What you get
Ideal for
- Teams debugging production with grep and SSH
- Distributed systems with no request correlation across services
- Organizations with compliance requirements for log handling
- Companies whose logging costs are growing unsustainably