Sr. Program Manager, Incident Management

Work from home Full-time role Hiring

AI at Zapier

At Zapier, we build and use automation every day to make work more efficient, creative, and human. So if you’re using AI tools while applying here - that’s great! We just ask that you use them responsibly and transparently.

Check out our guidance on How to Collaborate with AI During Zapier’s Hiring Process, including how to use AI tools like ChatGPT, Claude, Gemini, or others during our hiring process - and when not to.

Job Posted: April 9th, 2026

Location: Americas - North, Central and South America

As Zapier expands into the enterprise market, operational rigor matters more than ever. The Sr. Program Manager will own the end-to-end incident management program for Zapier's Product and Engineering organization: response, post-incident learning and actions, and everything in between. You'll report to the Director of Engineering for Internal Platforms & Infrastructure and be the DRI for the program's design, execution, and outcomes. You build the program and leverage AI to scale its impact.

We need someone with deep incident management expertise who's comfortable navigating ambiguity and stretching across engineering, support, security, and GTM. You have a thesis on where AI-enabled incident management is going and you'll lead us there. Zapier's product surface is expanding rapidly and with it, the complexity and stakes of incident management. This role grows with that complexity.

About You

You have deep incident management experience and you've moved beyond just executing it. You've built and led incident response programs, post-incident processes, SRE practices, or reliability-focused work. You know incident management deeply enough to rethink it, not just replicate it. You've ideally done 0-to-1 work in this space: stood up programs, defined standards, trained responders.
You re-engineer how work happens based on where AI is headed. You've created repeatable systems (workflows, agents, copilots, or automation) that fundamentally changed how work gets done. You use AI-native tools (Cursor, Claude Code, or similar) as your default, and orchestrate them into durable capabilities that compound over time. You have a forward-looking thesis on how AI will reshape your domain and you've already acted on it: stopping legacy work, redesigning processes around what AI makes possible, and redefining what the role itself looks like. You can quantify the impact on velocity, quality, or organizational capacity. You iterate, refine, and critically evaluate AI outputs, embedding quality standards and accountability into the systems you build, not just the outputs.
You're a builder, not a specialist. You have deep expertise in incident management, but you're not rigidly attached to how you've done it before. You can stretch into adjacent areas (reliability strategy, enterprise readiness, operational tooling) as the role evolves. A year from now, parts of this role may look very different, and you'll be the one driving that change. You build durable systems that work without you: processes that continue when you're on PTO or move to other work. You're energized by creating, not just maintaining.
You bring an upstream, systems mindset. You instinctively look for root causes and design solutions that scale beyond your immediate program. You understand how the full incident lifecycle (prevention, detection, response, learning) supports customer trust and enterprise readiness.
You influence without authority. You shape outcomes by building trust. You know how to build coalitions across engineering, support, security, GTM, and leadership. You lead change and not just implement it, you make it stick. You anticipate resistance, adapt your approach, and help others adopt new ways of working.
You have technical empathy. You can go toe-to-toe with engineers, support leads, and product leaders to clarify the "why" behind technical tradeoffs and incident decisions. You understand the role of observability (logs, metrics, traces), SLOs, and thresholds in incident response and prevention even if you're not the one implementing them.
You bias for velocity and clarity. You act decisively even in high ambiguity. When priorities collide, you clarify, decide, and help the org move forward. You communicate with relentless clarity: context and intent early, often, and candidly especially when it's uncomfortable.
You're analytical and hands-on with data. You can work directly with data tools (e.g., Databricks, SQL) to build rich reporting and meaningful insights. You understand incident tooling (incident.io or similar) and how it integrates with Slack, PagerDuty, and on-call workflows.
You work well remotely. Zapier is 100% remote. You communicate proactively, write clearly, and know when async works and when to jump on a call.

Things You'll Do

Own the incident program. Lead the design, evolution, and governance of incident processes across the Build organization both response and post-incident processes. Ensure workflows are consistent, auditable, and aligned with enterprise expectations. You are the DRI for incident management as a program.
Build AI-powered incident systems. Design and ship repeatable AI tools: automated incident summarization, intelligent severity classification, AI-assisted root cause analysis, postmortem draft generation, and more. Turn one-off AI experiments into durable workflows that compound over time.
Accelerate decisions. Create clarity in ambiguity, align stakeholders, and drive decisions across teams and zones. Serve as the point of contact for questions related to incident process, expectations, and best practices.
Surface and resolve systemic issues. Identify recurring org friction, drive root-cause solutions, and implement fixes that persist beyond individual incidents.
Build and maintain reporting. Build, maintain, and refine dashboards and reports using Databricks, Looker, and related tools. Translate data into actionable insight: identify trends, risks, weak signals, and hotspots. Communicate findings to the right audiences.
Raise the bar. Instill rigor and accountability. Coach responders and incident roles (Incident Commander, Support Leads, and new roles as they emerge). Produce and maintain clear documentation (playbooks, templates, guides) and deliver training for all incident roles and stakeholder groups.
Partner cross-functionally. Collaborate with engineering leads, EMs, product, support, security, GTM, and leadership to strengthen practices. Share clear insights, align expectations, and help teams act on opportunities for improvement. Your day-to-day counterparts are senior engineering leaders and engineering line managers.
Step in when needed. Step into incident response roles during business hours as appropriate to experience the work firsthand and inform program improvements. Facilitate retrospectives and go through the process for select incidents to help inspect and up-level the process.

Our Stack & Tools

Incident tooling: incident.io, PagerDuty, Slack, Zendesk
Data & Reporting: Databricks, Grafana, Looker
Observability context: Datadog, Grafana, Prometheus, Opensearch
Infra context: AWS, Kubernetes, Terraform (with SRE/Platform partners)
Collaboration: GitLab, Coda, Google Workspace

What Success Looks Like

The incident program is dependable and normalized. It's part of Zapier's operating rhythm. You own program direction and ensure day-to-day execution aligns with enterprise expectations across the full incident lifecycle.
Internal teams feel supported. Processes, communication, and tools reduce friction and meet the needs of engineering, support, and GTM partners. Stakeholder feedback is incorporated pragmatically.
Workflows run consistently with low friction. They're easy to follow, easy to learn, and allow people to focus their energy where it counts.
Systemic improvements persist. You elevate technical and program management rigor beyond individual incidents. The systems you build continue to work when you're not there.
Data quality is rich and trusted. Reports and insights help leadership understand trends, systemic risks, and improvement opportunities.
Outcomes improve measurably. Reduced incident frequency, faster time-to-resolution, higher stakeholder confidence, operational maturity increasing across engineering.
You're a force multiplier. The org has fewer blockers and more velocity than you found it.

Application Deadline:

The anticipated application window is 30 days from the date job is posted, unless the number of applicants requires it to close sooner or later, or if the position is filled.

Even though we’re an all-remote company, we still need to be thoughtful about where we have Zapiens working. Check out this resource for a list of countries where we currently cannot have Zapiens permanently working.

Apply To This Job

Apply

Sr. Program Manager, Incident Management

AI at Zapier

About You

Things You'll Do

Our Stack & Tools

What Success Looks Like

Application Deadline:

You might like

Product Manager

Marketing & Revenue Operations Administrator

Strategic Manager, Mergers & Acquisitions (Manufacturing)

Middle Manual QA Engineer

Project Manager

AI Solutions Architect

Accounts Receivable Representative

Senior Director, Western External Affairs and Communications (Remote)

Lead Instructor Led Training Producer

Lead Enterprise Architect

Gaming Content Moderator - English

Entry-Level Customer Service Representative – Delivering Exceptional Support to Small Business Owners in Home Service Trades at blithequark

Marketing Operations Associate

Experienced Customer Support Agent – Virtual Service Representative for Nationally Recognized Brands at arenaflex

Flight Attendant - In-Flight Safety

Join Today: Digital Sales Specialist Remote Position

Call Center Agent (100% Remote) US ONLY - Now Hiring

Experienced Customer Service Representative – Remote Call Center Opportunity at arenaflex

Looking for Online English Teacher (100% Remote) in San Diego, CA

Associate, Business Development Executive