Incident Response

Build an incident management graph where agents correlate alerts, trace impact, and coordinate response — with a full audit trail.

When an incident hits, three things fail simultaneously: context is scattered across dashboards, coordination happens in a chaotic Slack thread, and after the fact nobody can reconstruct what happened when. A graph fixes all three.

The ontology

node Service {
    slug: String @key
    name: String @index
    team: String
    tier: enum(critical, standard, internal)
    runbook_url: String?
}

node Alert {
    slug: String @key
    title: String @index
    body: String @index
    severity: enum(critical, high, medium, low)
    source: enum(datadog, pagerduty, cloudwatch, custom, agent)
    status: enum(firing, acknowledged, resolved)
    fired_at: DateTime
    resolved_at: DateTime?
    embedding: Vector(768)? @index
}

node Incident {
    slug: String @key
    title: String @index
    severity: enum(sev1, sev2, sev3, sev4)
    status: enum(detected, investigating, mitigating, resolved, postmortem)
    started_at: DateTime
    resolved_at: DateTime?
    summary: String?
    root_cause: String?
}

node Action {
    slug: String @key
    description: String @index
    type: enum(investigation, mitigation, communication, rollback, escalation)
    status: enum(proposed, in_progress, completed, skipped)
    assigned_to: String?
    created_at: DateTime
    completed_at: DateTime?
}

node Change {
    slug: String @key
    type: enum(deploy, config_change, feature_flag, infra, database_migration)
    description: String @index
    author: String
    deployed_at: DateTime
    commit_sha: String?
}

// Topology
edge DependsOn: Service -> Service
edge EmittedBy: Alert -> Service @card(1..1)

// Incident structure
edge CorrelatedTo: Alert -> Incident
edge ActionFor: Action -> Incident
edge CausedBy: Incident -> Change
edge Impacted: Incident -> Service

// History
edge RelatedIncident: Incident -> Incident
edge PreviouslyAffected: Service -> Incident

The service dependency graph (DependsOn) plus the incident structure (CorrelatedTo, Impacted, CausedBy) gives agents everything they need: what's happening, what's affected, what changed, and what to do.

The incident lifecycle

1. Alert correlation

Alerts fire from multiple sources. A correlation agent groups them:

query unlinked_critical_alerts() {
    match {
        $a: Alert { severity: "critical", status: "firing" }
        not { $a correlatedTo $_ }
    }
    return { $a.slug, $a.title, $a.source, $a.fired_at }
    order { $a.fired_at desc }
}

[
  {
    "slug": "alert-latency-api-2026033014",
    "title": "API p99 latency > 2s",
    "source": "datadog",
    "fired_at": "2026-03-30T14:02:00"
  },
  {
    "slug": "alert-errors-payments-2026033014",
    "title": "Payment service error rate > 5%",
    "source": "datadog",
    "fired_at": "2026-03-30T14:03:00"
  }
]

The agent sees two critical alerts within a minute. It checks if the services are connected:

query service_dependency($service_a: String, $service_b: String) {
    match {
        $a: Service { slug: $service_a }
        $b: Service { slug: $service_b }
        $a dependsOn {1, 3} $b
    }
    return { $a.name, $b.name }
}

If api-gateway depends on payment-service, these alerts are correlated. The agent creates an incident and links both alerts.

2. Impact analysis

What's in the blast radius?

query blast_radius($service: String) {
    match {
        $s: Service { slug: $service }
        $upstream dependsOn {1, 3} $s
    }
    return {
        $upstream.name, $upstream.tier, $upstream.team
    }
}

[
  { "name": "API Gateway", "tier": "critical", "team": "platform" },
  { "name": "Checkout Flow", "tier": "critical", "team": "commerce" },
  { "name": "Mobile App", "tier": "critical", "team": "mobile" },
  { "name": "Admin Dashboard", "tier": "standard", "team": "internal-tools" }
]

Three hops upstream from the failing service. The agent now knows which teams to notify and which services to monitor.

3. Recent change detection

What changed right before the incident?

query recent_changes($service: String) {
    match {
        $s: Service { slug: $service }
        $c: Change
        $c.type != "feature_flag"
    }
    return {
        $c.description, $c.type, $c.author,
        $c.deployed_at, $c.commit_sha
    }
    order { $c.deployed_at desc }
    limit 10
}

[
  {
    "description": "Upgrade payment processor SDK to v4.2",
    "type": "deploy",
    "author": "deploy-bot",
    "deployed_at": "2026-03-30T13:45:00",
    "commit_sha": "a1b2c3d"
  }
]

A deploy 17 minutes before the first alert. The agent creates a CausedBy edge linking the incident to this change.

4. Historical pattern matching

Has this service failed before in a similar way?

query similar_past_incidents($alert_embedding: Vector(768), $service: String) {
    match {
        $s: Service { slug: $service }
        $s previouslyAffected $past
        $past.status = "postmortem"
        $past_alert correlatedTo $past
    }
    return {
        $past.title, $past.root_cause, $past.summary,
        $past.started_at, $past.resolved_at
    }
    order { nearest($past_alert.embedding, $alert_embedding) }
    limit 3
}

Find past incidents on this service where the alert signatures were semantically similar. Surface the root cause and resolution from the postmortem. The responding engineer gets context in seconds, not after an hour of Slack archaeology.

5. Action tracking

Every response action is a node in the graph:

query incident_actions($incident: String) {
    match {
        $i: Incident { slug: $incident }
        $a actionFor $i
    }
    return {
        $a.description, $a.type, $a.status,
        $a.assigned_to, $a.created_at
    }
    order { $a.created_at asc }
}

[
  {
    "description": "Rollback payment SDK to v4.1",
    "type": "rollback",
    "status": "in_progress",
    "assigned_to": "oncall-platform",
    "created_at": "2026-03-30T14:10:00"
  },
  {
    "description": "Notify commerce team of checkout degradation",
    "type": "communication",
    "status": "completed",
    "assigned_to": "incident-bot",
    "created_at": "2026-03-30T14:08:00"
  }
]

The full incident timeline — every action, who did it, when, what status — is in the graph. The postmortem writes itself.

Multi-agent response

Correlation agent  →  groups alerts into incidents
                      (branch: correlate/20260330-1402)

Triage agent       →  runs blast radius, sets severity, pages teams
                      (branch: triage/inc-2026033014)

Investigation      →  finds recent changes, similar past incidents
agent                 (branch: investigate/inc-2026033014)

Action agent       →  proposes mitigation actions, tracks completion
                      (branch: respond/inc-2026033014)

All branches merge to main. The incident graph grows with every response. After resolution, a postmortem agent traverses the full incident subgraph — alerts, actions, changes, timeline — and drafts the postmortem document.

What you can build from here

Reliability scoring — Traverse PreviouslyAffected edges per service. Count incidents by severity over time windows. Surface services with degrading reliability trends.
Dependency risk mapping — Find services that are upstream of many critical paths but have low reliability scores. These are your single points of failure.
Runbook automation — When an incident matches a past pattern, auto-propose the actions that resolved it last time.
MTTR analysis — Query incident timelines: started_at to resolved_at. Segment by team, severity, and root cause type.

Incident Response

On this page