This job post has expired on July 04, 2026. It is likely that the position has already been filled.

AI Evaluation Engineer at Turing

posted 1 month ago

turing.com Contractor remote TBD 227 views

AI Evaluation Engineer (Python / Java / Web) | Contractor | 40 hrs/week | Worldwide Remote

Turing is seeking experienced Software Engineers to join its AI Evaluation team as AI Benchmark Authors. In this role, you will design and author high-quality evaluation tasks that measure the capabilities of advanced AI agents in real-world software development scenarios — directly influencing the benchmarking of frontier AI models used by leading AI research organizations.

What You'll Do

Design realistic software engineering evaluation tasks for AI agents
Write clear, unambiguous instructions defining expected outputs, constraints, and success criteria
Create reference solutions that successfully solve authored tasks
Develop verification criteria and automated test descriptions for task validation
Author domain-specific skill files covering workflows, conventions, and best practices
Ensure consistency across benchmark variants while maintaining rigorous evaluation standards
Review tasks for quality, edge cases, and failure modes to improve benchmark reliability
Collaborate with AI researchers, evaluators, and engineering teams to refine benchmark quality

Requirements

Bachelor's degree or higher in Computer Science, Software Engineering, or a related field
5+ years of hands-on software development experience
Strong expertise in at least one domain: Python, Java/JVM, or Web Application Development (Frontend, Backend, or Full Stack)
Excellent written English with the ability to craft precise technical instructions
Solid understanding of software engineering workflows, debugging, testing, and code quality practices
Experience with structured file formats such as JSON, Markdown, YAML, DOCX, or XLSX

Nice to Have

Experience with LLM evaluation, prompt engineering, or AI benchmarking
Background in creating technical assessments, coding challenges, or educational content
Familiarity with Docker, containers, or cloud-based development environments

Engagement Details

Commitment: 40 hours/week with 4-hour overlap with PST
Type: Contractor/Freelancer (no medical or paid leave benefits)
Duration: 2-month contract, starting next week

Why Work With Turing?

Contribute to cutting-edge AI projects with top AI research organizations
Flexible, fully remote work environment
Influence the evaluation of next-generation AI systems
Collaborate with a global network of highly skilled professionals

Apply on Turing Go back

Show all jobs of Turing

AI Evaluation Engineer at Turing

What You'll Do

Requirements

Nice to Have

Engagement Details

Why Work With Turing?

Related Jobs

Turing

TBD remote in US

Turing

TBD remote

Turing

Varies remote

Turing

Varies remote

Turing

Varies remote in US

Turing

Varies remote

Turing

Varies remote

Turing

Varies remote

Turing

Varies remote

Turing

Varies remote

Turing

Varies remote

Turing

Varies remote