Job Description
Job Description
- Develop and improve evaluation methodologies to assess model output quality, for both machine eval and human eval metrics and coverage.
- Design and implement scalable data pipelines to extract, transform, and structure product logs for evaluation use cases.
- Synthesize datasets for human or machine evaluation.
- Analyze and interpret results from A/B tests, offline benchmarks, and live experiments to drive actionable recommendations.
- Train ML classifiers to analyze and label user logs (e.g., classify intent, detect quality issues) for evaluation
- Draw insights from eval results and form recommendations, drive different eval experiments to find the most optimal solutions.
- Work closely with product managers, engineers, and researchers to define evaluation criteria aligned with product goals and user value.
- Create and maintain dashboards and reporting tools to monitor eval performance and trends.
- Contribute to the development of custom metrics that go beyond standard benchmarks to capture product-specific nuances.
- Stay current on the latest in LLM research on evaluation and prompting.
- Embody our Culture and Values.
- Doctorate in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 1 year(s) data-science experience (e.g., managing structured and unstructured data, applying statistical techniques and reporting results) OR Master's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 3 years data-science experience (e.g., managing structured and unstructured data, applying statistical techniques and reporting results)
- OR Bachelor's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 5 years data-science experience (e.g., managing structured and unstructured data, applying statistical techniques and reporting results)
- OR equivalent experience.
- 5 years of experience in data science, ML evaluation, or applied research.
- Working knowledge of LLM evaluation methods, including experience conducting both human evaluations and LLM-as-a-judge assessments
- Experience using Python, SQL, and common data analysis libraries for data processing and analysis.
- Ability to analyze complex problems, communicate findings clearly, and translate insights into actionable steps.
- Experience building or evaluating LLM applications in production.
- Product-driven thinking.
- Ability to work in a fast-paced environment, manage multiple priorities, and adapt to changing requirements and deadlines.