Reliability

Services and SLOs

Services connect monitors, owners, dependencies, incidents, and SLOs so reliability is organized around business systems rather than isolated checks.

Audience: SREs, service owners, platform teams

Service catalog

Create a service for each product surface, API, workflow, or internal platform dependency.
Assign criticality tiers such as customer critical, important, standard, or internal.
Assign an owner team so responders know who is responsible.
Link monitors that represent the service's availability and correctness.

Dependencies and blast radius

Add upstream dependencies for services this service needs to stay healthy.
Use dependency type to describe sync API, async queue, database, third-party, or internal service dependency.
Use criticality to separate hard blockers from lower-risk dependencies.
Downstream impact helps responders understand which services may be affected by a failing dependency.

SLO configuration

Name the SLO after the user-facing reliability promise, such as Checkout availability.
Set a target percentage that reflects the business promise.
Set latency thresholds when slow responses should count against reliability.
Set a window in days for the rolling measurement period.
Review budget consumed and budget remaining to understand reliability risk.

Burn-rate alerts

Telemetry-backed burn alerts detect fast error burn, slow error burn, and latency burn.
Burn alerts can open service incidents when reliability is degrading.
Burn alerts are strongest when services receive consistent telemetry and linked monitor data.

Related documentation

Monitors

Create HTTP, TCP, ping, heartbeat, and AI-agent synthetic monitors with thresholds and regions.

Telemetry

Ingest OTLP JSON logs, metrics, traces, and inspect trace detail inside AImonitoring.

Incidents

Acknowledge, investigate, route, resolve, and review service incidents.

Team and access management

Invite users, assign organization roles, and manage service team membership.