Agentic Coding Annotator (Offline Tasks) | Contractor | Remote | Turing
Turing is seeking experienced software practitioners to evaluate and improve datasets for agentic coding models. This is a high-precision, technically demanding role — not a basic annotation job. You'll work within realistic coding environments, review model trajectories, verify solutions, and produce high-quality annotations that directly influence frontier AI development.
About Turing
Turing is one of the world's fastest-growing AI companies, partnering with leading AI labs to advance frontier model capabilities in coding, reasoning, agentic behavior, and more. We build real-world AI systems that solve mission-critical challenges for companies worldwide.
Role Overview
This role focuses on offline evaluation tasks, which include:
- Designing realistic, multi-step coding tasks
- Calibrating tasks through user simulation
- Writing task-specific rubrics and binary evaluation criteria
- Grading and ranking model-generated trajectories
Day-to-Day Responsibilities
- Execute realistic coding tasks within an agentic coding harness while maintaining model blindness and session independence
- Verify model outputs by reading code, running commands, checking logs, and inspecting generated artifacts
- Perform targeted validation using tests, scripts, and manual checks
- Write clear, evidence-based rationales for trajectory rankings and assessments
- Design multi-step coding tasks including user intent and milestone structure
- Create and refine task-specific rubrics and evaluation criteria
- Review completed work for quality, completeness, consistency, and schema compliance
- Identify and escalate broken environments or process gaps with supporting evidence
Requirements
Software Engineering Fluency
(Mandatory)
- 5+ years of experience in software engineering, QA, developer tooling, data/ML engineering, or similar code-heavy roles
- Strong hands-on experience in at least 1–2 programming languages such as Python, JavaScript/TypeScript, Rust, Java, C/C++, Bash, Haskell, Swift, or SQL
- Ability to read unfamiliar codebases, debug issues, run tests, and evaluate functional correctness
Terminal & Tooling Skills
(Mandatory)
- Comfortable working in Linux/Ubuntu-like environments
- Proficient with terminal workflows, Git, code editors, package managers, test runners, JSON, YAML, and Markdown
- Familiarity with Docker and reproducible environments is a strong plus
Coding-Agent Workflow Familiarity
(Mandatory)
- Experience working with agentic coding tools such as OpenCode, Claude Code, Cursor, or similar platforms
Quality Judgment & Annotation Accuracy
(Mandatory)
- Ability to compare model trajectories and identify meaningful differences
- Distinguish correctness from style, communication quality, and agent behavior
- Evaluate solutions consistently using defined rubrics
- Write concise, evidence-based rationales — not generic summaries
Preferred Qualifications (Offline / Senior Candidates)
- Strong Docker skills and experience building/debugging reproducible environments
- Experience in large, complex repositories beyond greenfield or tutorial-level projects
- Demonstrated originality and sound engineering judgment in defining technical problems
- Ability to design realistic, non-trivial tasks that go beyond simple bug fixes or README flows
Contract Details
- Commitment: 8 hours/day with a 4-hour overlap with PST
- Employment Type: Contractor (no medical/paid leave included)
- Duration: 4 weeks, starting next week
Why Work With Turing?
- Contribute to cutting-edge AI projects with leading foundation model companies
- Work at the frontier of LLM evaluation and reasoning
- Fully remote and flexible with global teams
- Competitive compensation based on experience and project scope