Optimizely is focused on unlocking digital potential. We are the recognized category leader in Digital Experience Platform (DXP) and created the category for A/B Testing and experimentation software. We have incredible customers – isn’t that one of the most important aspects of looking for your next job? Optimizely has over 9,000 brands from global organizations such as Visa, Sky, Yamaha, Wall Street Journal to tech innovators like Atlassian, DocuSign, Fitbit, and Zillow. Not only are we financially sound and growing but we have unicorn status: Exceeded $300M in revenue in 2020, is profitable already, and has all strategic options ahead of itself. Optimizely continues to invest and addresses a market opportunity north of $30 billion, providing significant personal career growth opportunities. We are an inclusive culture with a global team of 1500+ people across the US, Europe, Australia, and Vietnam. We blend European and American business culture with emphasis on teamwork, inclusion, and moving fast. People make the difference! If you are looking to work on the next generation of digital technologies in a fast-paced, hyper-growth environment, apply! We’re just getting started... We are looking for a Senior Site Reliability Engineer to help build and scale our CloudOps capabilities. You will be responsible for designing, implementing, and operating critical infrastructure and platform services while collaborating closely with engineering, support, and product teams to improve the reliability, scalability, and performance of our systems. This is a hands-on technical role where you will be instrumental in shaping the SRE culture, driving automation, and ensuring high availability across all services. Responsibilities:
Champion a Site Reliability Engineering culture across the organization by sharing best practices, tools, documentation, and code.
Identify and automate manual operational tasks using scripting, infrastructure-as-code, and CI/CD pipelines.
Build and maintain observability (monitoring, logging, tracing) for all production systems to ensure reliability, availability, and performance.
Proactively monitor alerts across all platforms and coordinate with SRE, Operations, Engineering, and Support teams to ensure quick detection and resolution of incidents—minimizing MTTA/MTTR.
Lead and manage on-call rotations, driving a blameless incident management and postmortem culture.
Collaborate with development teams to define and implement SLOs, SLIs, and error budgets.
Ensure uptime SLAs are met through robust automation, testing, monitoring, and operational best practices.
Create and maintain runbooks, playbooks, and system documentation to ensure operational readiness and knowledge sharing.