1. About the Role:We are seeking a highly skilled Site Reliability Engineer with experience applying GenAI to automate and enhance the reliability of complex data platforms in Data Division. You will be responsible for building self-healing infrastructure, AI-powered observability, and automating incident response across data pipelines (e.g., Databricks, Glue, Kafka, Flink). This is a high-impact role where you will shape the future of data reliability at Techcombank, mentor engineers, and lead initiatives that span multiple teams and domains.2. Key Responsibilities:Platform Reliability & Automation• Design, implement, and operate reliable, scalable, and observable data platforms.• Automate incident triage, remediation, and postmortems using GenAI-powered tools.• Develop intelligent runbooks and self-healing workflows using LLMs.GenAI-Enabled SRE Practices• Build and integrate GenAI copilots for on-call support, anomaly detection, and RCA (root cause analysis).• Fine-tune or prompt engineer LLMs for specific use cases like summarizing logs, interpreting metrics, or generating remediation steps.• Leverage vector databases (e.g., FAISS, Weaviate) to retrieve telemetry and incident history for GenAI prompts.Observability & Anomaly Detection• Integrate GenAI with observability tools (e.g., Datadog, Prometheus, Grafana, OpenTelemetry).• Build systems for natural language querying of platform health and pipeline performance.• Collaborate with data engineers to monitor SLIs/SLOs across ingestion, transformation, and delivery layers.CI/CD & Risk Management• Integrate GenAI into CI/CD pipelines to generate blast radius analyses and deployment guardrails.• Use LLMs to assess the risk of configuration or schema changes before production rollout.• Automate validation and rollback strategies based on historical outcomes.