[Remote] AI QA Trainer – LLM Evaluation
Note: The job is a remote job and is open to candidates in USA. Agency is seeking an AI QA Trainer to help shape the future of AI through rigorous evaluation of large-scale language models. The role involves evaluating model reasoning, reliability, and safety while designing test plans and documenting failure modes to improve AI quality.
Responsibilities
- Converse with the model on real-world scenarios and evaluation prompts
- Verify factual accuracy and logical soundness
- Design and run test plans and regression suites
- Build clear rubrics and pass/fail criteria
- Capture reproducible error traces with root-cause hypotheses
- Suggest improvements to prompt engineering, guardrails, and evaluation metrics
- Partner on adversarial red-teaming, automation (Python/SQL), and dashboarding to track quality deltas over time.
Skills
- Shipped QA for ML/AI systems
- Safety/red-team experience
- Test automation frameworks (e.g., PyTest)
- Hands-on work with LLM eval tooling (e.g., OpenAI Evals, RAG evaluators, W&B)
- Evaluation rubric design
- Adversarial testing/red-teaming
- Regression testing at scale
- Bias/fairness auditing
- Grounding verification
- Prompt and system-prompt engineering
- Test automation (Python/SQL)
- High-signal bug reporting
- Clear, metacognitive communication
Education Requirements
- Bachelor's, master's, or PhD in computer science, data science, computational linguistics, statistics, or a related field
Benefits
- Health insurance
- PTO
Company Overview
Company H1B Sponsorship