Benture logo
next job →
Turing logo

AI Evaluation Engineer (Python) at Turing

posted 1 hour ago
turing.com Contractor remote TBD 28 views

AI Evaluation Engineer (Python) | Contractor | 40 hrs/week | Worldwide Remote

Turing is one of the world's fastest-growing AI companies, accelerating the advancement and deployment of powerful AI systems. We are seeking experienced Python developers to join our AI Evaluation team as AI Evaluation Specialists. In this role, you will design and author evaluation tasks that benchmark the capabilities of advanced AI systems on real-world software engineering challenges — directly influencing the next generation of frontier AI models.

What You'll Do

  • Design realistic, Python-focused software engineering evaluation tasks for AI agents
  • Write clear, precise instructions defining expected outputs, constraints, and success criteria
  • Create reference solutions that successfully solve evaluation tasks and satisfy validation requirements
  • Develop human-readable verifier descriptions documenting expected behaviors and evaluation checks
  • Author domain-specific knowledge files that guide AI systems on relevant workflows and best practices
  • Review tasks for clarity, consistency, edge cases, and overall evaluation quality
  • Analyze AI-generated outputs and identify common failure patterns
  • Collaborate with researchers and evaluation teams to improve benchmark quality and coverage

Requirements

  • Bachelor's degree or higher in Computer Science, Software Engineering, IT, or a related field
  • 3–5 years of hands-on Python development experience
  • Strong proficiency in Python and core software engineering concepts
  • Experience with applications, automation scripts, APIs, data-processing workflows, or backend systems in Python
  • Solid understanding of data structures, algorithms, debugging, testing, and software development best practices
  • Excellent written English with the ability to produce clear, unambiguous technical documentation
  • Comfortable working with structured formats: JSON, Markdown, YAML, DOCX, and XLSX

Nice to Have

  • Experience with LLM evaluation, prompt engineering, or AI benchmarking
  • Experience creating technical assessments, coding challenges, or educational content
  • Familiarity with Docker, containers, or cloud-based development environments

Engagement Details

  • Commitment: 40 hours/week with 4-hour overlap with PST
  • Type: Contractor/Freelancer (no medical or paid leave benefits)
  • Duration: 2-month contract, starting next week

Perks

  • Work on cutting-edge AI projects with leading research organizations
  • Flexible, fully remote work environment
  • Opportunity to shape the evaluation of next-generation AI systems
  • Collaborate with a global network of highly skilled professionals

Go back

Related Jobs

Benture logo
See All Jobs