[Remote] Software Engineer – AI Coding Evaluation
Note: The job is a remote job and is open to candidates in USA. MillionLogics is a global leader in IT solutions specializing in Data & AI, Cloud Solutions, and IT Consulting. They are seeking experienced Software Engineers to evaluate and improve the coding capabilities of frontier AI models by assessing AI-generated code and developing high-quality evaluation datasets and benchmarks.
Responsibilities
- Review and evaluate AI-generated code for correctness, efficiency, maintainability, and adherence to requirements
- Analyze software engineering tasks and validate whether proposed solutions meet expected outcomes
- Debug code, reproduce issues, and verify fixes across different programming environments
- Assess model-generated explanations, reasoning, and implementation approaches for technical accuracy
- Create, refine, and maintain evaluation datasets, benchmarks, and grading rubrics for coding tasks
- Identify edge cases, failure modes, and areas where AI systems struggle with software engineering problems
- Document findings clearly and provide structured feedback to improve evaluation quality and consistency
- Collaborate with project teams to establish quality standards and evaluation methodologies
Skills
- Bachelor's or Master's degree in Computer Science, Software Engineering, or a related technical field
- 3+ years of professional software engineering experience
- Strong proficiency in one or more of the following languages: Python, Java, C/C++, Go, Swift, Objective-C, PHP, or SQL
- Strong understanding of data structures, algorithms, software design principles, and debugging methodologies
- Experience performing code reviews and evaluating code quality in production or large-scale codebases
- Ability to analyze complex technical problems and assess solution correctness with minimal supervision
- Familiarity with version control systems (e.g., Git) and modern software development workflows
- Strong written communication skills and attention to detail
- Experience with AI/ML data annotation, NLP, prompt engineering, model evaluation, or LLM-related projects
- Experience evaluating AI-generated code, benchmark creation, or software quality assessment
Benefits
- Mode of Work: Remote
- Contract: 12 months
- Commitments Required: At least 4 hours per day and minimum 20 hours per week with overlap of 4 hours with PST
- Engagement type : Contractor assignment (no medical/paid leave)
Company Overview