Own and improve SLOs, SLIs, and error budgets for critical services across playback, login, subscription, recommendation, and API layers.Build and maintain observability stacks (Prometheus, Grafana, OpenTelemetry, Datadog) to proactively detect and resolve issues.Drive incident management, root cause analysis (RCA), and postmortem culture for service outages and performance degradation.Automate repetitive operational tasks via IaC (Terraform), CI/CD (GitHub Actions), and scripting (Python/Bash/Golang).Collaborate with backend, frontend, and data teams to design fault-tolerant, scalable infrastructure (GKE, Cloud Run, Cloud CDN, etc.).Work closely with security and platform teams to ensure system hardening, compliance, and zero-trust principles.Continuously assess infrastructure cost and performance trade-offs to optimize cloud spend (GCP preferred).Contribute to the evolution of our deployment strategy (blue/green, canary, A/B), especially during high-traffic events (e.g. livestreams, premieres).