Reliability

Observability analytics

Observability analytics turns raw monitor, telemetry, deployment, SLO, and incident signals into prioritized operational risk views.

Audience: SREs, engineering leaders, incident commanders, service owners

What analytics shows

Average service health score across modeled services.
Unhealthy services based on incidents, SEV1 impact, SLO burn, telemetry error rate, and p95 latency.
Breached SLO burn alerts and maximum burn rate per service.
Open incident counts, SEV1 counts, resolved incident counts, and mean time to resolve.
Deployment correlations when GitHub events happen near incident start.
Root-cause groups from correlated incidents and dependency context.

How to use it

Review the lowest-scoring services first during operational planning.
Use burn signals to decide whether reliability work should interrupt feature work.
Use deployment correlations to check whether a recent change may have triggered an incident.
Use root-cause groups to understand whether multiple incidents are symptoms of one upstream issue.

Data required

Services should be modeled with owner teams and tiers.
Telemetry improves error-rate and latency scoring.
SLO alert states improve burn-rate prioritization.
GitHub repository mappings and verified webhooks improve deployment correlation.

Related documentation

Services and SLOs

Model owned services, link monitors, define dependencies, and track service-level objectives.

Telemetry

Ingest OTLP JSON logs, metrics, traces, and inspect trace detail inside AImonitoring.

Integrations

Connect AImonitoring to incident response, workflow, deployment, telemetry, and automation providers.

Incidents

Acknowledge, investigate, route, resolve, and review service incidents.