Incident response

Incidents

Incidents collect monitor failures, telemetry SLO breaches, dependency context, responder activity, and post-incident follow-up in one workflow.

Audience: Responders, incident commanders, service owners

Incident lifecycle

Open: AImonitoring detects a confirmed failure or reliability breach.
Acknowledged: a responder accepts ownership and starts investigation.
Notes: responders add timeline context, findings, mitigation steps, and decisions.
Resolved: the incident is manually resolved or automatically closes after recovery conditions are met.
Review: teams create a post-incident review with timeline, impact, root cause, and action items.

Affected service, linked monitors, and current service health.
Correlation group and related incidents where available.
Dependency context for upstream causes and downstream impact.
SLO burn alerts and telemetry context when telemetry triggered the incident.
Plain-language summaries to speed initial triage.
Incident command fields for severity, commander assignment, and communications channel.
Recent deployment and GitHub webhook context when a mapped repository changes near incident start.

Acknowledge an incident to stop secondary escalation logic from treating it as unattended.
Add notes for investigation evidence and decisions.
Resolve incidents only after the customer-facing impact has ended.
Update incident command so responders can see the current commander, severity, and bridge or channel.
Create post-incident action items for prevention, detection, response, or communication gaps.

Post-incident reviews can generate an AI-assisted draft from timeline, dependency, SLO, deployment, and correlation evidence.
Drafts include summary, impact, probable root cause, what went well, what could improve, and suggested action items.
If AI is not configured, AImonitoring falls back to deterministic evidence-based review text so the workflow still works.
Published reviews remain controlled by responders; AI drafts do not publish automatically.

Acknowledgements, notes, manual resolution, review creation, review publishing, and action-item updates are audit logged.
Incident event timelines are kept with the incident record.
Alert delivery logs show notification attempts and outcomes.