
STEM Scientific Software & Evaluation Design | $45–$100/hr | Worldwide Remote
Join a cutting-edge project building large-scale evaluation benchmarks for advanced AI reasoning across scientific and engineering domains. As a Task Designer, you'll create graduate-level computational problems that challenge AI systems to use real scientific software tools — from querying simulations and interpreting outputs to designing experimental strategies and recovering hidden information from data.
This is not a typical annotation or labeling role. You'll be crafting original, research-grade problems, calibrating them against frontier AI models, and iterating until the difficulty hits the right target.
Strong candidates think like puzzle designers — building problems where difficulty stems from reasoning strategy, not brute computation, and where surface-level pattern matching won't suffice.