Mô Tả Công Việc
Own and improve SLOs, SLIs, and error budgets for critical services across playback, login, subscription, recommendation, and API layers.
Build and maintain observability stacks (Prometheus, Grafana, OpenTelemetry, Datadog) to proactively detect and resolve issues.
Drive incident management, root cause analysis (RCA), and postmortem culture for service outages and performance degradation.
Automate repetitive operational tasks via IaC (Terraform), CI/CD (GitHub Actions), and scripting (Python/Bash/Golang).
Collaborate with backend, frontend, and data teams to design fault-tolerant, scalable infrastructure (GKE, Cloud Run, Cloud CDN, etc.).
Work closely with security and platform teams to ensure system hardening, compliance, and zero-trust principles.
Continuously assess infrastructure cost and performance trade-offs to optimize cloud spend (GCP preferred).
Contribute to the evolution of our deployment strategy (blue/green, canary, A/B), especially during high-traffic events (e.g. livestreams, premieres).
Xem toàn bộ Mô Tả Công Việc
Yêu Cầu Công Việc
5+ years of experience as SRE, DevOps, or Production Engineer in large-scale environments.
Strong knowledge of Linux internals, networking, and systems performance tuning.
Deep experience with Kubernetes, containers, and service mesh technologies (Istio or Linkerd).
Proficiency with cloud platforms (preferably GCP), including IAM, Compute, GKE, Cloud CDN, Cloud Logging.
Solid experience with monitoring, logging, and alerting stacks (e.g. Prometheus, Grafana, ELK, Loki, Datadog).
Strong scripting or programming skills in Python, Go, or Bash.
Familiarity with CI/CD, IaC, and GitOps tools (Terraform, Helm, ArgoCD, Cloud Build).
Clear communication skills and a calm, analytical approach to solving complex problems in high-pressure environments.
Nice to Have
Experience supporting real-time media systems or video streaming platforms.
Knowledge of multi-region HA, failover, and edge optimization strategies (especially for Asia-Pacific markets).
Familiarity with error budgets, chaos engineering, and resiliency testing.
Background in supporting platform services for experimentation (A/B), personalization, or user engagement.
Xem toàn bộ Yêu Cầu Công Việc
Quyền Lợi
Own the reliability of a platform used by 20M+ users with large-scale live events and high concurrency.
Work in a modern, cloud-native environment (GCP, Kubernetes, Kafka, Iceberg, Cloud CDN).
Be part of a highly autonomous engineering culture focused on velocity, quality, and learning.
Influence architecture and process for the next generation of entertainment infrastructure in Vietnam and beyond.