[Remote] AI Systems Administrator
Note: The job is a remote job and is open to candidates in USA. MCI is one of the fastest-growing tech-enabled business services companies in the USA, specializing in customer experience and business process outsourcing. They are seeking a technically skilled AI Systems Administrator to support, maintain, and optimize the infrastructure for their artificial intelligence and machine learning environments, ensuring reliability, scalability, and security of AI systems.
Responsibilities
- Oversee, configure, monitor AI and ML systems, servers, and cloud environments to ensure optimal performance and uptime
- Manage GPU/CPU clusters and ensure efficient resource allocation for training and inference workloads
- Implement and maintain scalable infrastructure to support large language models (LLMs), data processing pipelines, and model deployment
- Optimize system performance through tuning, automation, and proactive maintenance
- Apply best practices for securing AI systems, ensuring data integrity, confidentiality and compliance with company and industry standards
- Manage user access, permissions, and security configurations across AI platforms
- Support the deployment and integration of AI models and APIs into production environments
- Collaborate with developers, data scientists, and prompt engineers to ensure seamless system functionality and workflow automation
- Monitor system health, usage, and performance metrics; diagnose and resolve infrastructure or software issues
- Maintain logs, conduct root cause analysis, and implement corrective actions to prevent recurrence
- Develop scripts and tools to automate system tasks, data transfers, and performance checks
- Support CI/CD pipelines for AI model updates and system maintenance
- Create and maintain detailed documentation of system configurations, procedures, and troubleshooting guides
- Provide technical support to AI teams, ensuring smooth operation of all AI systems and tools
- Stay up to date with advancements in AI infrastructure, cloud technologies, and MLOps practices
- Recommend and implement improvements to enhance system reliability and scalability
Skills
- Bachelor's degree in Computer Science, Information Technology, Data Engineering, or a related field
- 2+ years of experience in systems administration, DevOps, or infrastructure management (AI/ML environment experience preferred)
- Strong understanding of cloud platforms (AWS, Azure, GCP) and containerization technologies (Docker, Kubernetes)
- Experience with Linux/Unix administration, Python/Bash scripting, and automation tools (Terraform, Ansible, Jenkins)
- Familiarity with machine learning frameworks (TensorFlow, PyTorch) and AI model deployment pipelines
- Understanding of networking, security, and storage in distributed computing environments
- Experience with GPU-based computing and performance optimization for AI workloads
- Excellent problem-solving, troubleshooting, and documentation skills
- Strong collaboration and communication abilities to work with cross-functional AI and engineering teams
- Must be authorized to work in the country where the job is based
- Must be willing to submit up to a LEVEL II background and/or security investigation with a fingerprint
- Must be willing to submit to drug screening
Company Overview