Benture logo
 ←  next job →
Turing logo

SwarmBench Task Engineer – Research at Turing

posted 3 hours ago
turing.com Contractor remote TBD 32 views

SwarmBench Task Engineer – Knowledge/Research | Contractor | Fully Remote | ~40 hrs/week

Turing is looking for a highly analytical, research-driven engineer to design and build multi-agent benchmark tasks focused on knowledge synthesis and large-scale document analysis. This is a short-term contract role (1 month) with potential for extension, ideal for someone with a strong research background and hands-on experience in AI evaluation.

About Turing

Based in San Francisco, Turing is the world's leading research accelerator for frontier AI labs and a trusted partner for global enterprises deploying advanced AI systems. Turing accelerates frontier research with high-quality data, advanced training pipelines, and top AI researchers specializing in coding, reasoning, STEM, multilinguality, multimodality, and agents.

Role Overview

You will craft challenging, insightful benchmark problems in your research domain and devise elegant computational solutions that push the limits of multi-agent AI systems. Your work will directly shape how AI agents are evaluated on complex, real-world research tasks.

Key Responsibilities

  • Build multi-agent benchmark tasks requiring reading, analyzing, and synthesizing large document collections.
  • Curate real-world research corpora — academic papers, case studies, technical reports — and design questions demanding comprehensive analysis.
  • Write structured ground-truth oracles (JSON) with specific, verifiable answers that confirm the agent genuinely processed source material.
  • Design LLM judge prompts that evaluate agent output field-by-field against the oracle.
  • Create decomposition guides that distribute research across multiple parallel sub-agents (per document, per domain, then synthesis).

Required Qualifications

  • 5+ years of research experience in any scientific domain (academic or industry).
  • Strong reading comprehension and ability to extract structured information from unstructured text.
  • Experience with JSON/data structures — designing schemas and validating output formats.
  • Python scripting ability for judge scripts and data processing.
  • Experience with AI coding benchmarks (e.g., SWE-bench, Terminal-bench).
  • Comfortable with Docker — writing Dockerfiles, building images, and debugging container issues.
  • High attention to detail — oracle construction requires exact values, not approximations.

Strong Pluses

  • Experience with systematic reviews, meta-analyses, or large-scale literature surveys.
  • Familiarity with medical, legal, or scientific document analysis.
  • Experience with NLP or information extraction tasks.
  • Knowledge of LLM evaluation and benchmarking (MMLU, GPQA, SimpleQA).
  • Experience curating datasets for AI evaluation.

Contract Details

  • Commitment: 40 hours/week with 4 hours of PST overlap required.
  • Engagement Type: Contractor/Freelancer (no medical or paid leave benefits).
  • Duration: 1 month, with expected start next week.

Perks

  • Fully remote work environment.
  • Opportunity to contribute to cutting-edge AI research projects with leading LLM companies.
  • Potential for contract extension based on performance and project needs.

Go back

Related Jobs

Benture logo
See All Jobs