Reliability

Observability analytics

Observability analytics turns raw monitor, telemetry, deployment, SLO, and incident signals into prioritized operational risk views.

Audience: SREs, engineering leaders, incident commanders, service owners

What analytics shows

  • Average service health score across modeled services.
  • Unhealthy services based on incidents, SEV1 impact, SLO burn, telemetry error rate, and p95 latency.
  • Breached SLO burn alerts and maximum burn rate per service.
  • Open incident counts, SEV1 counts, resolved incident counts, and mean time to resolve.
  • Deployment correlations when GitHub events happen near incident start.
  • Root-cause groups from correlated incidents and dependency context.

How to use it

  • Review the lowest-scoring services first during operational planning.
  • Use burn signals to decide whether reliability work should interrupt feature work.
  • Use deployment correlations to check whether a recent change may have triggered an incident.
  • Use root-cause groups to understand whether multiple incidents are symptoms of one upstream issue.

Data required

  • Services should be modeled with owner teams and tiers.
  • Telemetry improves error-rate and latency scoring.
  • SLO alert states improve burn-rate prioritization.
  • GitHub repository mappings and verified webhooks improve deployment correlation.

Related documentation