[Remote] Manager, Site Reliability Engineering

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. Aya Healthcare is a rapidly growing workforce solutions provider in the healthcare industry, recognized for its exceptional workplace culture. The Manager of Site Reliability Engineering will lead a team focused on enhancing product and platform reliability, ensuring an exceptional experience for users while driving the adoption of AI-native operations and automation.

Responsibilities

Lead and grow the SRE team
Lead, mentor, and grow a team of high-performing Site Reliability Engineers across hiring, performance management, career development, and on-call rotation health
Set the operating cadence for the team — standups, incident reviews, SLO/error-budget reviews, post-incident learning, and capacity planning
Build a culture of blameless learning, technical depth, customer empathy, and disciplined ownership
Partner closely with DevSecOps, Security Engineering, DRE, Incident & Change Management, and product engineering leadership to remove cross-team friction
Own the reliability strategy for customer-facing products and internal platforms — defining SLOs, SLIs, and error budgets in partnership with product and engineering leadership, and operationalizing them in the release process
Lead major incident response as senior incident commander for severity-1 events; institutionalize blameless post-incident reviews and ensure systemic fixes ship
Champion proactive reliability — chaos engineering, game days, failure-mode analysis, capacity and load testing — well before incidents force the conversation
Manage software release support and 24/7 on-call escalation rotations across the platform surface area, with humane on-call load and clear escalation paths
Build the AIOps practice — anomaly detection, predictive alerting, intelligent correlation, and automated triage — to drive measurable reductions in MTTD and MTTR
Operationalize AI-assisted workflows for incident summarization, runbook generation, log and trace analysis, change risk scoring, and post-incident narrative drafting
Pilot and scale agentic remediation where appropriate, with strict guardrails, audit trails, and human-in-the-loop controls suitable for a HIPAA-regulated environment
Evolve the observability platform (Datadog metrics, logs, traces, RUM, synthetics, CI Visibility) so engineering teams can operate their own services with confidence and clear ownership
Treat reliability as a product with a roadmap, measurable outcomes, and an executive-credible narrative — not as overhead
Drive platform unit economics by partnering with FinOps and platform leadership on cost-to-serve, right-sizing, capacity efficiency, and waste elimination
Communicate outcomes to executive, product, and customer-facing stakeholders in plain language tied to clinician and client experience
Uphold HIPAA, PHI, and security obligations across every reliability decision, change, and tool selection

Skills

10+ years in a combination of Site Reliability Engineering, DevOps, Platform Engineering, or related production-operations roles
4+ years of direct people management experience — hiring, performance management, career development, and running remote on-call teams
Demonstrated ownership of reliability outcomes for customer-facing SaaS at meaningful scale — defining and operationalizing SLOs/SLIs/error budgets and using them to drive engineering prioritization
Deep Azure experience — 3+ years operating production workloads on Azure, with hands-on depth in AKS, networking, identity, and platform services. Equivalent depth in AWS or GCP will be considered
Modern observability fluency — production-grade experience with Datadog (or equivalent: New Relic, Dynatrace, AppDynamics) across metrics, logs, traces, RUM, and synthetics
AI in operations — hands-on experience integrating AI/LLM-assisted tooling into operational workflows (incident summarization, runbook generation, log analysis, anomaly triage, change risk scoring)
Incident command experience — proven ability to lead severity-1 incidents end-to-end, run blameless reviews, and convert lessons into systemic improvements
Regulated-environment instinct — operates with HIPAA, PHI, SOC 2, or comparable compliance constraints as a default mindset, not an afterthought
Executive-grade communication — translates reliability work into business outcomes for executive, product, and customer-facing audiences
Bachelor's degree in Computer Science, Information Technology, Engineering, or related field — or an equivalent combination of education, training, and experience
Cloudflare at the edge — production experience with Cloudflare CDN, WAF, Workers, Access (ZTNA), Tunnel, Turnstile, and certificate management
IaC at scale — Terragrunt and Terraform in a multi-environment, policy-gated pipeline; experience evolving IaC from 'it works' to 'it scales safely.'
CI/CD maturity — GitHub Actions with OIDC/workload identity federation, OPA/Conftest policy-as-code, progressive delivery, and DORA-metric instrumentation
Container platform depth — Kubernetes/AKS in production, including Helm, ingress, service mesh, autoscaling, and node lifecycle
ITSM integration — ServiceNow for change, incident, and problem management; experience tying observability and CI data into ITSM workflows
Identity ecosystem — operating in an Okta / Entra ID / M365 identity environment, including PIM, conditional access, and service-principal hygiene
Chaos and resilience engineering — running game days, fault injection, and resilience exercises as a routine practice
FinOps fluency — cost-to-serve, right-sizing, capacity efficiency, and unit-economics work in cloud environments
Agile delivery — Scrum/Kanban delivery with Jira; comfortable operating in a quarterly planning + continuous-delivery cadence

Benefits

Free premium medical, dental, life and vision insurance
Generous 401(k) match
Aya also offers other benefits to those that are eligible and where required by applicable law, including reimbursements and discretionary bonuses
Aya provides paid sick leave in accordance with all applicable state, federal, and local laws. Aya’s general sick leave policy is that employees accrue one hour of paid sick leave for every 30 hours worked. However, to the extent any provisions of the statement above conflict with any applicable paid sick leave laws, the applicable paid sick leave laws are controlling
Celebrations! We hit our goals and reward ourselves.
Company-sponsored virtual events, happy hours and team-building activities are always on the horizon — plus, you get a special treat on your birthday!
Unlimited DTO — we believe in time off!
Virtual yoga, meditation or boot camp classes offered daily

Company Overview

Aya Healthcare is a provider of workforce optimization solutions for healthcare. It was founded in 2001, and is headquartered in San Diego, California, USA, with a workforce of 5001-10000 employees. Its website is https://www.ayahealthcare.com.

Company H1B Sponsorship

Aya Healthcare has a track record of offering H1B sponsorships, with 1 in 2026, 3 in 2025, 1 in 2024, 2 in 2023, 6 in 2022, 1 in 2021, 2 in 2020. Please note that this does not guarantee sponsorship for this specific role.

Apply To This Job

Apply

[Remote] Manager, Site Reliability Engineering

You might like

[Remote] Principal Product Designer

[Remote] Backend Development Engineer Intern

[Remote] Senior Software Engineer, Backend

[Remote] Senior People Operations Business Partner

[Remote] Quantitative Research Consultant (Short-term)

[Remote] Full Stack Java Developer

[Remote] Senior System Software Engineer - DevOps and Infrastructure Automation

[Remote] Senior Backend Engineer - Anywhere Cloud

[Remote] Software Engineer II, AI

[Remote] Staff AI Platform Engineer

Director of Strategic Partnerships

HR Customer Support Specialist – Remote Contract Role Focused on Open Enrollment, Benefits Administration, and EDI Coordination

GIS Mapping - QGIS & Spatial Data | Remote

State & Federal Compliance Analyst

Global Accounts Payable Manager

AI Security / GenAI Engineer

High Net Worth Tax Managing Director

Senior Director, Talent Development

Experienced Online Chat Support Specialist – Deliver Exceptional Customer Experiences at arenaflex

Strategy & Operations Lead Consultant - Remote - CST/EST