[Remote] Senior Cloud Operations Engineer
Note: The job is a remote job and is open to candidates in USA. The Linux Foundation is a driving force in fostering open source collaboration and supporting communities across a range of projects, including PyTorch. They are seeking a Senior Cloud Operations Engineer who will focus on the infrastructure operations of the PyTorch project, automating processes, optimizing cloud-native tools, and ensuring a robust and scalable cloud environment.
Responsibilities
- Manage multi-cloud environments, primarily focusing on AWS services (EKS, EC2, S3, IAM, ELB)
- Contribute to architectural exercises with open source community and technical leads to validate new cloud infrastructure
- Implement and maintain infrastructure-as-code using Terraform via pytorch/ci-infra and pytorch/test-infra
- Optimize cloud resource utilization and implement FinOps practices for cost management and reporting
- Design, implement, and maintain CI/CD pipelines using GitHub Actions and ARC, including runner configurations and other elements of the CI ecosystem
- Debug and triage issues in build and test pipelines, including experience with unit testing
- Develop monitoring and alerting solutions for CI/CD workflows and critical infrastructure
- Manage and optimize Cloudflare CDN deployments for PyTorch assets (R2/S3)
- Implement best practices for CDN and overall infrastructure security
- Develop comprehensive monitoring and observability solutions using Datadog, AWS CloudWatch, and other telemetry data collection and processing tools
- Review and recommend monitoring solutions as project and community needs evolve
- Participate in on-call rotations supporting operations and incident response using incident.io
- Establish and maintain escalation procedures and resolution processes
- Participate in ci-infra and multi-cloud working groups and support architecture decisions
- Collaborate with external contributors and promote DevOps best practices
- Manage GitHub repositories, including user onboarding and access control
- Attend and contribute to technical meetings, including Infrastructure, CI Workflow, and Technical Advisory Council sessions
- Develop and maintain technical documentation for infrastructure and processes
- Provide guidance on developer best practices and tooling
- Create and update runbooks for common operational tasks and incident response
Skills
- Ability to work with communities made up of industry specialists and collaborate outside of the Linux Foundation
- Bachelor's degree in Computer Science, Engineering, or related field
- 7+ years of experience in cloud operations with significant AWS expertise
- Strong knowledge of infrastructure-as-code principles and tools, particularly Terraform
- Proficiency in scripting languages (Python, TypeScript, Bash) and containerization technologies (Docker, Kubernetes)
- Experience with Cloudflare CDN management and optimization
- Expertise in implementing and managing monitoring solutions, specifically Datadog and AWS CloudWatch
- Familiarity with incident management tools and processes, particularly incident.io
- Demonstrated experience in CI/CD pipeline design and implementation
- Strong problem-solving skills and ability to troubleshoot complex systems
- Excellent communication skills and experience collaborating with open source communities
- Experience with PyTorch or other open source communities
- Multi-cloud expertise across AWS, GCP, and Azure
- GitHub ARC experience
- Knowledge of FinOps principles and cloud cost optimization strategies
- Contributions to open source projects, especially in infrastructure management roles
- Familiarity with the Linux Foundation or similar open source foundations
- Experience mentoring other engineers and fostering a collaborative team environment
Benefits
- The Linux Foundation maintains a predominantly remote workforce
- Committed to hiring top-notch talent
- Providing a flexible and supportive work culture
- Collaboration is embedded in our DNA
- Work closely together while not being confined to a traditional office space
Company Overview