Skip to content

See what caused your incident before the war room starts.

When an incident happens, Incidentary tells you where the failure first appeared, when it happened, and how it spread. Ready the moment the alert fires. No guesswork. No archaeology.

Incidentary trace view showing the causal chain, truth cards, and inspector for a synthetic Redis cluster failover incident

The audit

Don't take our word for it.

Especially not in marketing copy. Inspect these three things before you install. Everything else on this page is downstream.

And one more thing we won't do: guess. Incidentary makes no inferences about your incidents. Everything in the artifact is something one of your services actually reported.

The artifact

Open one artifact.
Read four answers.

Every Incidentary artifact ships with the same four-answer header. Not because the answers are easy — because the questions never change.

INC-2444 · checkout-service14:22 UTCpartial
where the failure broke

session-service DB_QUERY 500

Redis GET timed out after 1.5s × 3 retries — cluster failover in progress.

where it spread

cdn-edge checkout-service session-service redis

what we don't know

Whether the retry budget amplified load on the failing primary.

1 gap at warehouse-api (out of critical path)

what to look at next

Inspect Redis cluster state at redis-node-3.prod.internal:6379.

Verify session-service retry policy: 3 attempts, 1.5s each.

We don't tell you why it broke. That's still your job. We just make sure the question starts in the right place.

Recent incidents

Five incidents.
Five different first sentences.

When the artifact lands, the first thing every responder reads is one sentence. Incidentary writes it. It says exactly what happened, in the order it happened, in the language your team already uses.

Written by code, not by a model — every word maps to an event your services actually reported. No LLM. No inference. No "we think the issue is..."

  1. 01
    sev-1checkout-svcINC-241214:22 UTC
    checkout-svc called payments, which returned 503 after 1247ms. The error propagated to api-gateway.
  2. 02
    sev-3orders-apiINC-228709:08 UTC
    orders-api retried inventory 47 times within 3 seconds before timing out. The retry pattern matched a previous incident (INC-1903).
  3. 03
    sev-2checkout-svcINC-251109:14 UTC
    checkout-svc deployed at 09:14:00 UTC. First confirmed break at 09:14:4242 seconds after rollout completed. Reverting the deploy correlated with alert resolution.
  4. 04
    sev-2api-gatewayINC-261522:41 UTC
    api-gateway hit a 30-second timeout calling search-svc. Search-svc was unreachable from the gateway's region. 6 services visible, 2 gaps in the network path.
  5. 05
    sev-1users-svcINC-270103:17 UTC
    users-svc started returning 500s 4 minutes after a database migration began. The migration locked the users table for read traffic. 3 services affected, all visible.

The same shape, every time. The channel opens to a sentence — not to "wait, which database?"

The mechanism

How it works.(Spoiler: there's no AI involved. On Purpose.)

Four steps. Plain mechanics. The trick isn't intelligence — it's timing.

  1. Step 01

    Capture continuously

    The SDK records every outbound call, every error, every slow query — the moment they happen. By default we capture the skeleton: timing, status, causal shape. When the pre-arm signals trip, we elevate to full detail. Events buffer locally and flush in the background. Your services keep running. We keep listening.

    flush every 1s · skeleton (timing + causal shape) by default
  2. Step 02

    Correlate as events arrive

    The correlator builds the causal graph in real time, not on demand. By the time anything goes wrong, the graph already exists. We're not assembling at alert time — we're waiting to be asked.

    causal graph: streaming, not on demand
  3. Step 03

    Lock the window when something looks off

    Anomaly thresholds (latency spikes, error bursts, retry storms) trip the pre-arm sequence. The surrounding causal window locks the moment something looks wrong — so when the alert fires, the lead-up is already preserved.

    pre-arm window: 60s–5min, while signals stay hot
  4. Step 04

    Deliver at the alert

    When PagerDuty (or OpsGenie, or your custom webhook) fires, Incidentary assembles the artifact within seconds. The link lands in Slack with the Truth Cards already populated. You open one URL. The room opens to evidence.

    Incidentary14:14:23 UTC

    Pre-arm captured · payment-api p99 +320%

    The 2m36s window before this alert is preserved.

    where
    session-service · DB_QUERY 500
    when
    14:13:47 UTC · T−12s
    what
    Redis cluster failover in progress
    Open the artifact →
    How Incidentary's message appears in Slack the moment an alert fires.
    artifact ready: ≤2s after webhook

We're not replacing your APM.

You read Incidentary first, then go to Datadog knowing exactly what you're looking for. Think of it as the index for the rest of your stack — the page you read before you start scrolling Loki.

The job

The demo starts in 9 min.The page just fired.

Or it's 2:14am. Or you're three messages behind in standup and your phone has been buzzing for forty seconds. The hour decides who's watching. It doesn't decide what you need from the next thirty seconds.

You need the cause, named, in the language your services already use. You need an artifact you can paste in the channel without a paragraph of context. You need it before the ETA pings start, before the Slack guesses start, before the customer email gets escalated to your VP.

  1. Stop the "wait — did anyone deploy?" message.

  2. Stop the every-five-minutes "any update?" ping from the room.

  3. Stop pasting screenshots from four tools into one war-room thread.

  4. Stop telling Marketing "we don’t have an answer yet" for the third time.

  5. Stop forecasting an ETA you can’t forecast because you don’t know the cause.

  6. Stop reconstructing the timeline from scrollback the next morning.

  7. Stop writing "it appears that…" in the postmortem.

  8. Stop saying "I’m not sure" in the customer-facing email.

  9. Stop the second war room because the first one didn’t reach a conclusion.

  10. Stop the on-call rotation feeling like a tax the senior engineers pay.

The artifact is the alert. The cause is in the first frame. The chain was already assembled by the time you opened the link.

You make the call from evidence.You go back to whatever the page interrupted.

Open a real artifact

incidentary.com/demono signup, no throwaway email

The install

Five minutes to your first artifact.
One destination.

Already running OpenTelemetry? Add Incidentary as one more exporter. Don't have OTel yet? Install the SDK on one service. Either door, same artifact.

Add to your existing collector

No package to install. No agent to run. Three lines of YAML in the pipeline you already maintain.

full quickstart guide →
otel-collector.yaml
exporters:
  otlp/incidentary:
    endpoint: api.incidentary.com:4317
    headers:
      authorization: "Bearer ${INCIDENTARY_API_KEY}"

service:
  pipelines:
    traces:
      exporters: [otlp/incidentary]

The next incident is already on its calendar.

Open one. See for yourself. Then decide whether the next one belongs in here too.

We'd rather you opened one before you signed up. Signup is for after.