Site Reliability Engineering

Google Cloud - SRE fundamentals: SLIs, SLAs and SLOs

SRE begins with the idea that a prerequisite to success is availability.

What Is a Site Reliability Engineer?

  • A site reliability engineer is responsible for the monitoring, automation, and reliability of IT operations. They use software development tools to automate IT operations tasks like change management, incident response, and production system management. They’re also responsible for monitoring the health of software deployments and relaying logs and data back to the developers.

Service Level Indicators

SLI 는 시스템의 가용성을 파악하기 위한 핵심 지표 를 의미한다. 지표로는 error rate (http 요청에 대한 http status 5xx 의 비율) 와 latency 등이 있다.

Service Level Objectives

SLO 는 시스템에서 기대되는 가용성을 설정한 목표 이고, 이것을 위반하면 incidents 로 간주하고 opsgenie 등으로 incidents 알람이 온다.

References