Benture logo
 ←  next job →
Turing logo

SwarmBench Task Engineer — SWE at Turing

posted 2 hours ago
turing.com Contractor remote TBD 34 views

SwarmBench Task Engineer (SWE/Code) | Contractor | Remote | 4-Week Engagement

Turing is seeking an experienced SwarmBench Task Engineer to design and build high-quality multi-agent benchmark tasks rooted in real-world software engineering workflows. This is a short-term contractor role ideal for senior engineers passionate about AI evaluation and frontier model development.

About Turing

Turing is one of the world's fastest-growing AI companies, partnering with leading AI labs to advance frontier model capabilities in reasoning, coding, agentic behavior, and more. We build real-world AI systems that solve mission-critical challenges for global enterprises.

Role Overview

In this role, you will create benchmark tasks grounded in real open-source code changes — including bug fixes, migrations, and refactors — used to evaluate how effectively AI agents navigate large codebases, apply precise modifications, and produce correct, testable outputs. You'll work within the Harbor evaluation framework and collaborate with global teams at the frontier of LLM evaluation.

Day-to-Day Responsibilities

  • Build multi-agent benchmark tasks based on real-world open-source code changes (bug fixes, migrations, refactors)
  • Work within the Harbor evaluation framework to run and validate tasks inside Docker environments
  • Write clear, precise task instructions specifying file paths, function signatures, expected behavior, and constraints
  • Design and implement Python-based verification scripts to validate correctness of agent-generated code changes
  • Create decomposition strategies that split complex code changes across multiple independent sub-agents
  • Debug and refine tasks within containerized environments to ensure reproducibility and determinism
  • Evaluate task performance signals and continuously improve task quality, clarity, and difficulty

Requirements

  • 5+ years of experience in Python and JavaScript development
  • Experience with AI coding benchmarks (e.g., SWE-bench, Terminal-Bench)
  • Strong ability to read and navigate large open-source codebases (e.g., Django, Flask, FastAPI, Node.js)
  • Solid familiarity with Git workflows — pull requests, diffs, cherry-picking, and commit-level navigation
  • Comfortable writing Dockerfiles, building images, and debugging container issues
  • Experience writing test scripts using pytest, unittest, or custom assertion-based frameworks
  • Ability to produce clear, precise, and unambiguous technical specifications

Engagement Details

  • Commitment: 8 hours/day with a 4-hour overlap with PST
  • Type: Contractor (no medical/paid leave benefits)
  • Duration: 4 weeks, starting next week

Go back

Related Jobs

Benture logo
See All Jobs