Our company is an innovator at the nexus of fitness, technology, and media. We are the largest interactive fitness platform in the world and have recently gone public. The firm has reinvented the fitness industry by developing a first-of-its-kind subscription platform that seamlessly combines the best equipment, proprietary networked software, and world-class streaming digital fitness and wellness content, creating a product that its Members love.
We are looking for a Site Reliability Engineer with a focus on Kubernetes operations to work with teams across the organization to help build and maintain a monitorable, performant, reliable, and highly scalable deployment platform. We are a growing team of engineers tackling challenging problems with scaling Kubernetes to handle thousands of nodes and pods spread across many deployments.
The Kubernetes working group works closely with development teams to ensure that the platform is robust, stable, and delivers features that include the following:
- Automatic, fast autoscaling for live rides and special large events
- Hosting critical infrastructure that ensures that our members have the best experience possible on tens of thousands of pods across multiple clusters
- Provides a platform for machine learning (and other awesome workloads) so that we can be at the forefront of the industry
What You'll Be Doing:
- Evangelize best practices for building and operating highly reliable systems
- Serve as subject matter expert in observability and monitoring
- Consult in system design to meet reliability and capacity requirements
- Automate everything, from infrastructure down to day-to-day tasks.
- Conduct timely post-mortems of infrastructure incidents
- Assist with all aspects of operational security and compliance
- Seek out potential threats to security and reliability and advocate solutions
- We work with Amazon Web Services, Chef, Python, Ubuntu, Nginx, Jenkins, and Terraform, Docker, & Kubernetes
What We’re Looking For:
- Experience maintaining scalable and stable Kubernetes clusters.
- Knowledge of best practices when it comes to the observability and monitoring required of running Kubernetes at scale.
- Knowledge of best practices in regard to securing a Kubernetes cluster and its deployments at scale.
- A passion for helping development teams make the transition to a container-native world.
- Experience with CI/CD Systems such as Jenkins, ArgoCD, Harness, Tekton, etc.
- Experience deployment infrastructure using Infrastructure as Code utilities such as Terraform
- Know when to triage and when to dive down into a root-cause analysis.
- Passion for reliable, scalable, observable software with strong sense of ownership.
- Experience with a programming language like Python, Golang, Java,