Reliability

Services and SLOs

Services connect monitors, owners, dependencies, incidents, and SLOs so reliability is organized around business systems rather than isolated checks.

Audience: SREs, service owners, platform teams

Service catalog

  • Create a service for each product surface, API, workflow, or internal platform dependency.
  • Assign criticality tiers such as customer critical, important, standard, or internal.
  • Assign an owner team so responders know who is responsible.
  • Link monitors that represent the service's availability and correctness.

Dependencies and blast radius

  • Add upstream dependencies for services this service needs to stay healthy.
  • Use dependency type to describe sync API, async queue, database, third-party, or internal service dependency.
  • Use criticality to separate hard blockers from lower-risk dependencies.
  • Downstream impact helps responders understand which services may be affected by a failing dependency.

SLO configuration

  • Name the SLO after the user-facing reliability promise, such as Checkout availability.
  • Set a target percentage that reflects the business promise.
  • Set latency thresholds when slow responses should count against reliability.
  • Set a window in days for the rolling measurement period.
  • Review budget consumed and budget remaining to understand reliability risk.

Burn-rate alerts

  • Telemetry-backed burn alerts detect fast error burn, slow error burn, and latency burn.
  • Burn alerts can open service incidents when reliability is degrading.
  • Burn alerts are strongest when services receive consistent telemetry and linked monitor data.

Related documentation