See all roles

[Remote] Staff Site Reliability Engineer

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. Thrive Market is an online, membership-based market focused on making healthy and sustainable living accessible. They are seeking a Staff Site Reliability Engineer to establish their SRE practice, define reliability metrics, and ensure system scalability during rapid growth.

Responsibilities

  • Define, implement, and own Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across critical platform services
  • Build and maintain comprehensive monitoring, alerting, and observability systems using tools like Datadog, Prometheus, Grafana, or similar platforms
  • Establish error budgets and use them to balance feature velocity with reliability investments
  • Lead incident response efforts, conduct blameless postmortems, and drive systemic improvements that prevent recurrence
  • Design and implement chaos engineering practices to proactively identify failure modes before they impact members
  • Architect and optimize our Kubernetes-based container orchestration platform for reliability, performance, and cost efficiency
  • Support large infrastructure migrations, ensuring a smooth transition with minimal disruption to business operations
  • Contribute to the evaluation and execution of potential platform migrations, with a focus on reliability planning and risk mitigation
  • Design and implement automated deployment pipelines that enable rapid, error-free releases with feature flags and built-in rollback/roll-forward capabilities
  • Develop and own disaster recovery plans, capacity planning models, and system hardening initiatives
  • Collaborate closely with product engineering teams to help them scale their infrastructure in AWS and adopt SRE best practices
  • Help establish SRE as a practice at Thrive Market, defining the team’s charter, processes, and engagement model with product engineering teams
  • Champion a culture of operational excellence, continuous improvement, and data-driven reliability decisions
  • Create and maintain technical documentation covering architecture decisions, runbooks, incident response procedures, and operational playbooks
  • Participate in weekly on-call rotations and help build sustainable on-call practices that avoid burnout
  • Identify systemic problems and inefficiencies across the engineering organization and make strategic recommendations for improvement

Skills

  • B.S. in Computer Science or equivalent professional experience
  • 7+ years of hands-on experience in SRE, DevOps, or Infrastructure Engineering, with a proven track record of improving reliability at rapidly growing companies
  • Deep expertise in Kubernetes (K8s) — including cluster management, Helm charts, service meshes, and production-grade container orchestration
  • Strong systems engineering background with advanced proficiency in Linux administration
  • Advanced scripting and automation skills in Bash, Python, Golang, Ruby, or similar languages
  • Extensive experience with core AWS services including EC2, ECS/EKS, S3, VPC, IAM, CloudWatch, Route 53, RDS, and Lambda
  • Strong experience with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi, or similar)
  • Hands-on experience defining and implementing SLOs, SLIs, and error budgets in production environments
  • Deep understanding of CI/CD pipelines and deployment strategies (blue-green, canary, rolling deployments)
  • Expertise in monitoring and observability platforms (Datadog, Prometheus, Grafana, New Relic, or similar)
  • Strong knowledge of web application infrastructure, networking, load balancing, and security best practices
  • Excellent communication skills with the ability to lead incident response and facilitate blameless postmortems
  • Experience with e-commerce platforms (Magento, Shopify, or comparable) and the unique reliability challenges they present at scale
  • Experience with ConcourseCI, Github Actions (GHA) or similar deployment frameworks
  • Experience with chaos engineering tools and practices (Gremlin, Litmus, Chaos Monkey, or similar)
  • Familiarity with GitOps workflows (ArgoCD, Flux) and service mesh technologies (Istio, Linkerd)
  • Experience building and managing cost-optimization strategies for cloud infrastructure
  • Background in establishing SRE practices in organizations transitioning from traditional DevOps models
  • Experience with configuration management tools (Ansible, Chef, Puppet, or similar)

Benefits

  • Comprehensive health benefits (medical, dental, vision, life and disability)
  • Competitive salary (DOE) + equity
  • 401k plan
  • 9 Observed Holidays
  • Flexible Paid Time Off
  • Subsidized ClassPass Membership with access to fitness classes and wellness and beauty experiences
  • Ability to work in our beautiful office in Playa Vista
  • Free Thrive Market membership with exclusive employee discount
  • Coverage for Life Coaching & Therapy Sessions on our holistic mental health and well-being platform

Company Overview

  • Thrive Market is a membership-based online company that offers natural and organic food products. It was founded in 2013, and is headquartered in Los Angeles, California, USA, with a workforce of 501-1000 employees. Its website is https://thrivemarket.com.
  • Apply To This Job

    You might like

    [Remote] Senior DevOps Engineer/Site Reliability Engineer-East Coast

    Work from home Full-time role

    [Remote] Business Development Representative

    Work from home Full-time role

    [Remote] Information Technology Project Manager

    Work from home Full-time role

    [Remote] Aftersales Account Manager

    Work from home Full-time role

    [Remote] Program Manager

    Work from home Full-time role

    [Remote] Supervision Consultant

    Work from home Full-time role

    [Remote] Senior Software Developer - Oracle Health, Platform Engineering

    Work from home Full-time role

    [Remote] Training Manager \- Human Services Program \- Remote

    Work from home Full-time role

    [Remote] Legal Counsel

    Work from home Full-time role

    [Remote] Business Development Intern at Oncology Startup

    Work from home Full-time role

    Licensed Professional Counselor - Mental Health Service Provider- Remote

    Work from home Full-time role

    Remote Data Entry Specialist – Home‑Based Administrative Support – $24–$34 per Hour – arenaflex

    Work from home Full-time role

    Remote Part‑Time Chat Moderator – Community Safety & Engagement Specialist for arenaflex Discord

    Work from home Full-time role

    Experienced Customer Support Agent with English and Chinese – Deliver Exceptional Player Experiences at arenaflex

    Work from home Full-time role

    FullStack Engineer (Python/ React)

    Work from home Full-time role

    RN- Care Review Clinician- Utilization Review (Remote- CA License Req)

    Work from home Full-time role

    Senior ML Data Scientist - Women’s Health

    Work from home Full-time role

    Customer Service Specialist | GumGum | $23 – $30 | Remote (United States)

    Work from home Full-time role

    [Remote] Call Center Service Representative - Remote (10:00 am-6:30 pm) - IDAHO– July 22, 2026

    Work from home Full-time role

    Remote 1099 Licensed Counselors and PMHNPs in Montana & Arizona

    Work from home Full-time role