See all roles

Senior Site Reliability Engineer — Infrastructure & Architecture

Work from home Full-time role Hiring

VAEIT’s Infrastructure & Architecture team is the engineering backbone of a 40-person organization delivering network engineering consultancy and network management solutions for Department of Defense customers.

We own the entire infrastructure surface area — AWS accounts, CI/CD pipelines, identity management, security compliance, and production deployments. Demand from the engineering organization is rapidly outpacing current capacity.

This is a force-multiplier hire. The right person will reduce operational risk, accelerate delivery across all engineering teams, and help us transition from reactive support to proactive platform engineering.

What You’ll Own:

AWS Infrastructure & Production Operations

  • Operate and evolve a multi-account AWS environment (5 accounts) including ECS Fargate, RDS, Lambda, CloudFront, and multi-VPC architectures
  • Manage production ECS services across multiple accounts and clusters
  • Define and maintain all infrastructure as code using Terraform — no manual configuration
  • Design and manage networking, IAM, security groups, and cross-account access patterns

CI/CD & Developer Experience

  • Own and improve GitHub Actions pipelines used across the engineering organization
  • Build and maintain workflows for multiple teams and tech stacks (Go, C#, Python)
  • Reduce build times and increase deployment reliability to accelerate delivery

Identity, Access & IT Systems

  • Administer Microsoft Entra ID (Azure AD) for identity and SSO
  • Manage user provisioning, groups, and access policies
  • Own GitHub Enterprise configuration, permissions, and security controls

Security & Compliance

  • Strengthen software supply chain security, including SBOM generation
  • Support and improve FedRAMP compliance posture
  • Enforce least-privilege IAM and conduct security audits across environments

Observability & Incident Response

  • Operate and evolve monitoring systems (CloudWatch, Prometheus, Alertmanager)
  • Improve signal-to-noise ratio in alerting and detection

Physical Infrastructure & MLOps

  • Manage on-premise GPU servers for ML training and inference
  • Bridge cloud and on-prem infrastructure, enabling scalable ML workloads in AWS
  • Support data science teams with reproducible environments and deployment pipelines
  • Maintain GPU tooling (NVIDIA drivers, CUDA) and containerized workloads

Automation & Tooling

  • Build internal tools (Go, Python, Bash, PowerShell) to eliminate manual work
  • Extend Ansible-based configuration management
  • Treat operations as software — automate everything possible

Platform Standards & Engineering Excellence

  • Define standards for logging, health checks, configuration, and deployment
  • Build “golden paths” (templates, starter repos, shared workflows)
  • Champion observability practices (structured logging, tracing, SLOs)
  • Review infrastructure and deployments to catch reliability and security issues early

Enabling Agentic LLM Systems

  • Build infrastructure for LLM-powered products (GPU compute, model serving, vector databases)
  • Design deployment pipelines for model versioning, evaluation, and inference routing
  • Enable rapid experimentation with self-service GPU-backed environments
  • Ensure AI systems meet DoD security requirements (audit logging, isolation, provenance tracking)

What We’re Looking For:

  • 5+ years in SRE, DevOps, or Platform Engineering
  • Deep AWS experience (ECS/Fargate, RDS, VPCs, IAM, Lambda, CloudFront, multi-account setups)
  • Strong Terraform skills (modules, state management, code reviews)
  • Experience building and scaling CI/CD systems (GitHub Actions preferred)
  • Hands-on Linux administration (including physical or bare-metal systems)
  • Experience with GPU/ML infrastructure (CUDA, containerized workloads)
  • Proficiency in at least two: Go, Python, Bash, PowerShell
  • Strong networking fundamentals and debugging skills
  • Experience with identity systems (Entra ID / Azure AD, SSO/SAML)
  • Excellent written communication in a remote, async environment
  • Highly self-directed and execution-oriented
  • U.S. Citizenship (required for DoD work)

Strongly Preferred:

  • Experience in DoD or FedRAMP-regulated environments
  • MLOps tooling (MLflow, Weights & Biases, pipeline orchestration)
  • Migrating GPU workloads from on-prem to AWS (EC2 GPU, SageMaker, ECS)
  • Ansible for configuration management
  • Software supply chain security (SBOMs, signing, vulnerability scanning)
  • Prometheus/Alertmanager experience
  • GitHub Enterprise administration at scale
  • Container security and ECS optimization
  • Familiarity with .NET ecosystems
  • Exposure to Rust

Who You Are:

  • You apply software engineering rigor to operational challenges
  • You instinctively automate rather than rely on manual processes
  • You thrive in high-autonomy, high-impact environments
  • You communicate clearly and proactively in async workflows
  • You’ve operated as a senior IC in a small team with broad ownership

VAE, Inc. is a full service IT Infrastructure Solutions Company focused on building, securing and supporting our clients’ mission critical enterprises. We provide a distinctive array of design, integration and implementation services as well as fully managed service offerings. VAE is at the forefront of leveraging multi-tenant capable technologies and shared IT services to create secure, reliable and cost-effective end-to-end services and solutions. We deliver exceptional infrastructure solutions with extremely talented employees using a client-focused partnering approach.

Certifications

VAE, Inc. is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to sex, race, color, religion, national origin, disability, protected Veteran status, age, or any other characteristic protected by law.

Apply To This Job

You might like

Senior Engineer/Analyst - FAA

Work from home Full-time role

Environmental Protection Specialist-

Work from home Full-time role

Environmental Protection Specialist/BIOLOGIST

Work from home Full-time role

ARCHAEOLOGIST

Work from home Full-time role

Costpoint Consultant

Work from home Full-time role

Accountant - Deltek Vantagepoint

Work from home Full-time role

Senior Accountant

Work from home Full-time role

Staff Accountant

Work from home Full-time role

Enterprise Account Executive

Work from home Full-time role

Business Development Manager - China

Work from home Full-time role

Embedded Software Engineer - PC Compute

Work from home Full-time role

Senior Account Executive

Work from home Full-time role

Experienced Retail Customer Training Specialist – Apple Creative

Work from home Full-time role

Senior Manager, Customer Service – Customer Success Strategist at arenaflex

Work from home Full-time role

Remote Data Entry Clerk at blithequark - Flexible Part-Time Opportunity with Competitive Pay and Comprehensive Benefits

Work from home Full-time role

Immediate Hiring: Data Entry Clerk - Work Remote Worldwide, No Experience Needed at arenaflex

Work from home Full-time role

Sales Merchandiser - Vestal, NY and surrounding area

Work from home Full-time role

Experienced Part-time Remote Data Entry Clerk / Administrative Assistant – Join arenaflex's Dynamic Team

Work from home Full-time role

MTAP - Vulnerability Disclosure Program Lead

Work from home Full-time role

Clinic Technician, Dialysis - 40 Hours - Macomb, MI

Work from home Full-time role