Mô Tả Công Việc
System Reliability & Performance
Own the reliability, scalability, and performance of core production systems.
Perform advanced performance troubleshooting and tuning across OS, network, and application layers.
Optimize resource usage on bare-metal Linux servers to maximize efficiency and reliability.
Data Infrastructure Reliability
Operate and scale our enterprise messaging and event streaming system with Kafka.
Ensure high availability and performance of our data warehouse with ClickHouse.
Automation & Observability
Enhance system observability through metrics, tracing, and logging (Prometheus, Grafana, CheckMK, OpenTelemetry).
Design and maintain alerting systems that balance coverage with actionable signals.
Incident Response & Coordination
Lead high-severity incident response and cross-team coordination as the arbiter when failures have multi-team impact.
Drive blameless postmortems and systemic improvements.
Reliability Culture & Mentorship
Mentor engineers on performance tuning, deployment safety, and reliability-first design.
Promote a culture of automation, ownership, and operational excellence.
Xem toàn bộ Mô Tả Công Việc
Yêu Cầu Công Việc
Experience
5+ years in SRE, systems engineering, or infrastructure-focused roles (with at least 2+ years in a senior or lead position).
Strong track record managing large-scale production systems on bare-metal Linux.
Technical Skills
Expert-level skills in Linux internals, system performance troubleshooting, and tuning.
Hands-on experience operating and scaling Kafka or equivalent messaging systems.
Hands-on experience operating and scaling ClickHouse or similar OLAP database.
Solid coding/scripting ability in Python, Go, or Bash.
Proficiency with Infrastructure-as-Code tools (Terraform, Ansible, etc.).
Experience building and operating highly available distributed systems.
Soft Skills
Analytical problem-solver with a strong performance-first mindset.
Advocates for automation and reducing toil.
Communicates clearly across both technical and non-technical teams.
Thrives in high-accountability, reliability-driven environments.
Nice-to-Have
Hands-on experience operating Kubernetes clusters on a scale.
Familiarity with modern Data Lakehouse architecture.
Prior experience with capacity planning and benchmarking at scale.
HIRING PROCESS
Phone Screening > Onsite Interviews > Offering.
Xem toàn bộ Yêu Cầu Công Việc
Hình thức
Full-time
Mức lương
Thỏa thuận
Báo cáo tin tuyển dụng: Nếu bạn thấy rằng tin tuyển dụng này không đúng hoặc có dấu hiệu lừa đảo,
hãy phản ánh với chúng tôi.
Tham khảo: 10 Dấu hiệu nhận biết hành vi lừa đảo qua tin tuyển dụng.
Tham khảo: 10 Dấu hiệu nhận biết hành vi lừa đảo qua tin tuyển dụng.