Manage and improve system reliability through SLO, SLI, and SLA practices.Design and implement observability systems (metrics, logs, tracing, alerting) using tools like Prometheus, Grafana, ELK, etc.Build and automate CI/CD pipelines and Infrastructure as Code (IaC) using tools such as Terraform, Ansible, Pulumi, Helm.Collaborate in the analysis, design, and deployment of systems and processes to ensure reliability, observability, and scalability.Optimize system cost, performance (latency, throughput), and security.Operate and optimize Kubernetes clusters (EKS); strong knowledge of Docker, Kubernetes, Helm is required.Develop internal tools to automate workflows and support other teams.Participate in incident response, root cause analysis, postmortem reviews, and improve incident handling processes.Support and coordinate with NOC (Network Operation Center) teams.Be part of the on-call rotation when needed.