SwarmBench Task Engineer — Data Analysis | Contractor | Remote | 4-Week Engagement
Turing is seeking experienced SwarmBench Task Engineers specializing in Data Analysis to design and develop high-quality multi-agent benchmark tasks that evaluate the analytical reasoning, coordination, and execution capabilities of advanced AI systems. This is a short-term, high-impact contractor role working at the frontier of LLM evaluation.
About Turing
Turing is one of the world's fastest-growing AI companies, accelerating the advancement and deployment of powerful AI systems. We partner with leading AI labs to advance frontier model capabilities in reasoning, coding, agentic behavior, and more — and we build real-world AI systems that solve mission-critical challenges for enterprises.
Role Overview
In this role, you will build realistic benchmark tasks requiring AI agents to analyze large, complex, multi-source datasets, decompose work across specialist sub-agents, and arrive at specific, verifiable conclusions. Tasks may involve structured and semi-structured data such as CSVs, JSON files, logs, reports, survey results, vendor assessments, and financial or operational documents.
Day-to-Day Responsibilities
- Design and author multi-agent benchmark tasks centered on complex data analysis workflows
- Create realistic synthetic datasets or curate real-world style datasets across domains such as finance, operations, security, or market analysis
- Build tasks requiring agents to perform cross-referencing, anomaly detection, contradiction identification, and statistical computation across multiple sources
- Develop decomposition guides that split analytical work across specialist sub-agents (e.g., financial, technical, security, or operations analysts)
- Write precise oracle logic or verification scripts that validate specific analytical conclusions
- Create reproducible evaluation environments using Python and Docker
- Review task performance signals to ensure strong separation between weaker and stronger agentic systems
- Refine tasks to improve determinism, clarity, difficulty, and scoring quality
Requirements
- 5+ years of experience in data analysis
- Strong proficiency in SQL and Python for data analysis and scripting (pandas, NumPy, or similar)
- Experience working with real-world, messy datasets (CSV, JSON, logs, reports)
- Ability to design non-trivial analytical questions with clear, specific, and verifiable answers
- Solid understanding of statistical concepts (averages, distributions, outliers, correlations)
- Familiarity with AI coding benchmark environments (e.g., SWE-bench, Terminal-Bench)
- Comfortable working with Docker (writing Dockerfiles, building images, debugging containers)
Contract Details
- Duration: 4 weeks (expected start: next week)
- Commitment: 8 hours/day with a 4-hour overlap with PST
- Type: Contractor position (does not include medical/paid leave benefits)
Why Work With Turing?
- Contribute to cutting-edge AI projects with leading foundation model companies
- Work on high-impact tasks at the frontier of LLM evaluation and reasoning
- Fully remote with flexible collaboration across global teams