Incidentary Docs

Why Incidentary

The real competition is not another observability product. It's the tools you're already using — and why they fail at 2am.

Why Incidentary

The real competition is not Datadog, Grafana, or Honeycomb. It is the four things your team reaches for when a service degrades at 2am. Each of them works in normal conditions. None of them work under incident conditions.


The scenario

A service degrades. The oncall engineer is paged. It is 2:17am.

The service health check is returning 200s. The latency P99 is elevated. Three downstream services are throwing errors. The Slack thread already has six engineers in it.

What do you reach for?


grep production logs

Works until the relevant event is in a rotated log, on a different host, or buried in a stream of high-cardinality noise you don't have a query for yet.

grep is a search tool. Incident investigation is a causal reasoning problem. You can grep for an error string and find it. What you cannot do is grep for why that error happened, which service called which other service three hops upstream, or whether the error was the cause or a symptom.

Under incident conditions, log volume spikes. The thing you're looking for is rarely a unique string. And the log that explains it may be on a host you haven't SSHed into yet.


kubectl logs

A stream of events with no causal ordering, no relationship to the incident timeline, and nothing left after the pod restarts.

kubectl logs gives you a chronological stream from a single pod. What you need is a causal graph across multiple pods, multiple services, and multiple time windows.

The ordering in kubectl logs is wall-clock order within one process. It tells you nothing about which upstream request triggered which downstream call, or how the state at service A propagated to the failure at service B.

And if the pod has restarted — which it has, because Kubernetes restarted it as part of the failure — the logs are gone.


console.log

Works in development. Does not work in production at scale. No structured context, no trace correlation, no pre-arm window.

console.log in production is manual, inconsistent, and unstructured. You added it to debug something last Tuesday. It is not there for the thing that is broken tonight.

Even when you have structured logging, unstructured logs are append-only streams. They carry no parent-child relationship between events. You cannot reconstruct a causal chain from a flat log file. You can only reconstruct a chronology — and chronology is not causality.


The Slack thread where six engineers are guessing

The implicit incident management system for most teams. High latency, high noise, no single source of truth. The coordination overhead is the second outage.

Every team that does not have a structured incident trace has this thread. Engineer A is grepping logs. Engineer B is reading metrics. Engineer C is checking the deploy timeline. Engineer D has a theory. Engineers E and F are being paged into the thread to share context they may or may not have.

The coordination overhead is itself a compounding failure. While the thread is filling up with theories, the service is still down.


The problem being solved

By the time any of the above tools yield an answer, the damage is done. You are reconstructing causality from fragments — not reading it.

The hardest part of incident response is not fixing the thing. It is finding the thing. The fix is usually one line. The investigation is usually forty minutes.

Incidentary captures the causal trace before the alert fires. When the page comes in, the trace is already assembled. You are not hunting — you are reading.


What Incidentary does instead

The SDK instruments your service boundary. Every inbound and outbound HTTP call creates a causal event that carries a trace ID and a parent-child relationship to the upstream call that triggered it.

Events are captured continuously in a pre-arm window — a rolling buffer of trace data collected before any alert threshold is crossed. When an alert fires, the pre-arm window is retained. The trace that explains why the alert fired already exists.

Your oncall engineer opens the trace URL from the alert. They see the causal waterfall: every service, every span, every status code, in order. The question "why did this happen" has an answer before the Slack thread even opens.

See How It Works for the technical integration details.

On this page