Incident response

Incidents

Incidents collect monitor failures, telemetry SLO breaches, dependency context, responder activity, and post-incident follow-up in one workflow.

Audience: Responders, incident commanders, service owners

Incident lifecycle

  • Open: AImonitoring detects a confirmed failure or reliability breach.
  • Acknowledged: a responder accepts ownership and starts investigation.
  • Notes: responders add timeline context, findings, mitigation steps, and decisions.
  • Resolved: the incident is manually resolved or automatically closes after recovery conditions are met.
  • Review: teams create a post-incident review with timeline, impact, root cause, and action items.

Incident context

  • Affected service, linked monitors, and current service health.
  • Correlation group and related incidents where available.
  • Dependency context for upstream causes and downstream impact.
  • SLO burn alerts and telemetry context when telemetry triggered the incident.
  • Plain-language summaries to speed initial triage.
  • Incident command fields for severity, commander assignment, and communications channel.
  • Recent deployment and GitHub webhook context when a mapped repository changes near incident start.

Responder actions

  • Acknowledge an incident to stop secondary escalation logic from treating it as unattended.
  • Add notes for investigation evidence and decisions.
  • Resolve incidents only after the customer-facing impact has ended.
  • Update incident command so responders can see the current commander, severity, and bridge or channel.
  • Create post-incident action items for prevention, detection, response, or communication gaps.

AI-assisted reviews

  • Post-incident reviews can generate an AI-assisted draft from timeline, dependency, SLO, deployment, and correlation evidence.
  • Drafts include summary, impact, probable root cause, what went well, what could improve, and suggested action items.
  • If AI is not configured, AImonitoring falls back to deterministic evidence-based review text so the workflow still works.
  • Published reviews remain controlled by responders; AI drafts do not publish automatically.

Auditability

  • Acknowledgements, notes, manual resolution, review creation, review publishing, and action-item updates are audit logged.
  • Incident event timelines are kept with the incident record.
  • Alert delivery logs show notification attempts and outcomes.

Related documentation