Description
In the dynamic landscape of On, the tech thrives much like a spirited runner: always moving, always improving. We are building technology that continues to supercharge the growth of On, helping to ignite the human spirit through movement. We’re seeking a Staff Site Reliability Engineer to ensure our digital platforms deliver exceptional performance, reliability, and scalability to support our global customer base.
You will join a skilled and dynamic team of cloud & site reliability engineers dedicated to transforming On’s technological foundation. We are crafting scalable, resilient cloud solutions to power internal operations, enhance product performance, and support On’s growth.
As a Staff Site Reliability Engineer (SRE) at On, you will play a pivotal role in designing, building, and maintaining our cloud infrastructure to support our e-commerce platforms, customer-facing applications, and internal systems. You will work closely with engineering teams to drive reliability, optimise performance, and implement automation, serving as a technical expert and mentor within the team.
– System Reliability & Performance: Ensure high availability (99.99%+ uptime), scalability, and performance of On’s digital platforms through proactive optimisation and robust infrastructure design. – Infrastructure Development: Build and maintain cloud-based infrastructure using Infrastructure-as-Code (IaC) tools. – Automation: Develop and implement automation solutions to streamline deployments, reduce toil, and enhance monitoring. – Incident Response: Lead incident resolution, perform root cause analyses, and implement preventive measures to minimise downtime and improve system resilience. – Monitoring & Observability: Design and maintain monitoring, logging, and alerting systems to ensure proactive issue detection and resolution. – Collaboration: Partner with software engineering, product, and security teams to align infrastructure with business objectives and ensure secure, scalable systems. – Capacity Planning: Analyse and forecast infrastructure needs to support On’s growth, balancing performance and cost efficiency. – Mentorship: Provide technical guidance and mentorship, fostering a culture of continuous learning and improvement. – Compliance & Security: Ensure systems meet industry standards for data privacy and security.
As a key member of our team, you will shape our cloud infrastructure strategy, ensuring robust, efficient, and sustainable systems that drive innovation. Join us in Berlin, to make a lasting impact on On’s digital future!
– Extensive experience in site reliability engineering with a track record of managing complex, high-traffic systems. – Strong expertise in cloud platforms (GCP) and container orchestration (Kubernetes, GKE). – Proficiency in scripting and programming (e.g. in Python, Go) for automation and tooling. – Experience with CI/CD pipelines (ArgoCD, GitHub Actions) and IaC (Terraform). – Solid understanding of networking, load balancing, and DNS management. – Experience with observability and monitoring for cloud native environments. – Strong analytical skills with a proactive approach to resolving complex technical challenges. – Excellent communication skills, with the ability to explain technical concepts to diverse stakeholders.
Nice to Have: – Experience with e-commerce platforms or high-traffic consumer applications. – Background in performance engineering, including load testing and capacity optimisation for peak traffic events (e.g., product launches, Black Friday). – Experience optimising global content delivery networks (CDNs) for low-latency, high-performance user experiences (e.g., Cloudflare). – Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
Technology
In short