🦉
Programming Notes
  • My Programming Notes
  • CKA Exam Preparation
    • Certified Kubernetes Administrator
    • Setup Minikube
    • Network Design Principles
    • Role-Based Access Control (RBAC)
    • Namespace
    • Resource Quota
    • Pod
    • Deployment
    • Deployment: Rollout
    • ConfigMap
    • Service
    • Service: kubectl expose
    • Pod: Resources Management
    • Pod & Container: Quality of Service Class
    • Pod & Container: Probes
    • Limit Range
    • Scaling: Manual
    • Scaling: Horizontal Pod Autoscaler
    • Persistent Volume & Claim
    • Secret
    • Ingress: Routing
    • Ingress: TLS
    • Ingress: Rate Limit
    • Ingress: Basic Auth
    • Ingress: CRD (Custom Resource Definition)
    • Job
    • CronJob
    • Mutli-Node Cluster
  • Golang
    • Generics
    • Context
    • Goroutines and Channels in Go
    • Goroutine: Concurrency vs Parallelism
    • Goroutine: Performance & Tradeoffs
    • JSON: omitzero
  • Rust
    • Arrays & Slices
    • Closures
    • Generics & Traits
    • Iterators
    • Run Code Simultaneously
    • String vs &str
    • Tests
    • Rustlings Exercises
      • Variables
      • Functions
      • If
      • Primitive Types
      • Vectors
      • Move Semantics
      • Structs
      • Enums and Matching Pattern
      • Strings
      • Modules
      • Hashmaps
      • Options
      • Error Handling
      • Generics
      • Traits
      • Lifetimes
      • Tests
      • Iterators
      • Smart Pointers
      • Threads
      • Macros
      • Quiz 1
      • Quiz 2
      • Quiz 3
  • Software Engineering
    • CAP Theorem
    • Circuit Breaker
    • Decoupling
    • GraphQL: Query Caching
    • HMAC Signature
    • Idempotency
    • Monolith VS Microservice
    • OWASP Top 10 2021
    • PCI DSS
    • PostgreSQL: Partitioning
    • PostgreSQL: Replication
    • Protobuf & gRPC
    • Redis: Streams
    • Resource Scaling
    • Signed URL
    • SOLID
    • Stack VS Heap
    • Stateful VS Stateless
  • Site Reliability Engineering
    • Chaos Engineering
    • Distributed Tracing
    • Kubernetes (k8s)
    • SLA, SLO, and SLI Metrics
    • Site Reliability Engineer
  • Others
    • FFMPEG Cheat sheet
Powered by GitBook
On this page
  • Understanding SLA, SLO, and SLI
  • Service Level Agreement (SLA)
  • Service Level Objective (SLO)
  • Service Level Indicator (SLI)
  • Example Relationships Between SLA, SLO, and SLI
  • Calculating Availability and Reliability
  • Examples of Defining Metrics for a Service
  • Let’s define SLAs, SLOs, and SLIs for a simple web API
  • Monitoring and Implementation
  1. Site Reliability Engineering

SLA, SLO, and SLI Metrics

Understanding SLA, SLO, and SLI

Service Level Agreement (SLA)

  • Definition: A contract between a service provider and a customer specifying the expected level of service.

  • Purpose: Defines obligations and consequences if the agreed*upon reliability is not met.

  • Example: "The service will be available 99.9% of the time per month. If this is not met, a refund of 10% will be issued."

Service Level Objective (SLO)

  • Definition: A specific, measurable target for the level of service reliability.

  • Purpose: Serves as a benchmark to ensure the SLA is met.

  • Example: "99.95% of HTTP requests will return a response within 200ms."

  • Relation to SLA: Typically stricter than the SLA, giving room to address failures before breaching the SLA.

Service Level Indicator (SLI)

  • Definition: A metric that quantifies system performance to track compliance with SLOs.

  • Purpose: Acts as the raw data used to measure whether an SLO is met.

  • Example: "The percentage of successful HTTP requests over the past 30 days."

Example Relationships Between SLA, SLO, and SLI

  • SLI: Measured uptime = 99.93%

  • SLO: Target uptime = 99.95%

  • SLA: Guaranteed uptime = 99.90%

Calculating Availability and Reliability

Availability is often expressed as a percentage of uptime over a given period:

Availability (%)=(UptimeTotal Time)×100\text{Availability (\%)} = \left( \frac{\text{Uptime}}{\text{Total Time}} \right) \times 100Availability (%)=(Total TimeUptime​)×100

Example:

  • Total Time: 30 days (43,200 minutes)

  • Downtime: 30 minutes

Availability=(43,200∗3043,200)×100=99.93%\text{Availability} = \left( \frac{43,200 * 30}{43,200} \right) \times 100 = 99.93\%Availability=(43,20043,200∗30​)×100=99.93%

Reliability measures the likelihood of a system performing without failure over a specific time:

Reliability (%)=e∗(Total FailuresTotal Time)\text{Reliability (\%)} = e^{*\left(\frac{\text{Total Failures}}{\text{Total Time}}\right)}Reliability (%)=e∗(Total TimeTotal Failures​)

Example:

  • Failures: 2

  • Time: 100 hours

Reliability=e∗(2100)=e∗0.02≈98.02%\text{Reliability} = e^{*\left(\frac{2}{100}\right)} = e^{*0.02} \approx 98.02\%Reliability=e∗(1002​)=e∗0.02≈98.02%

Examples of Defining Metrics for a Service

Let’s define SLAs, SLOs, and SLIs for a simple web API

1. SLI Examples:

  • Latency: 95% of requests complete within 200ms.

  • Uptime: Percentage of time the service is reachable.

  • Error Rate: Percentage of failed requests.

2. SLO Examples:

  • Uptime SLO: "The service uptime will be at least 99.95% per month."

  • Latency SLO: "95% of requests will complete within 200ms over the past week."

  • Error Rate SLO: "The error rate will not exceed 0.1% of total requests over the past month."

3. SLA Example:

  • SLA: "The service will maintain 99.9% uptime per month. For every 0.1% below this threshold, a 5% refund of the monthly fee will be issued."

Monitoring and Implementation

Monitoring SLIs:

  • Latency: Use tools like Prometheus to track response time.

  • Uptime: Use uptime monitoring tools like Pingdom or a custom Prometheus exporter.

  • Error Rate: Count HTTP 4xx and 5xx responses using metrics.

Visualizing in Grafana:

  • Create panels for each metric to display SLI performance over time.

  • Set alerts when an SLO is violated (e.g., error rate exceeds 0.1%).

PreviousKubernetes (k8s)NextSite Reliability Engineer

Last updated 3 months ago