Common Tasks
- Scoring Prompts: Utilize golden datasets for comparing prompt outputs with ground truths and incorporate human or AI evaluators for quality assessment.
- One-off Bulk Jobs: Ideal for prompt experimentation and iteration.
- Backtesting: Use historical data to build datasets and compare how a new prompt version performs against real production examples.
- Regression Testing: Build evaluation pipelines and datasets to prevent edge-case regression on prompt template updates.
- Continuous Integration: Connect evaluation pipelines to prompt templates to automatically run an eval with each new version (and catologue the results). Think of it like a Github action.
Examples Use-Cases
- Chatbot Enhancements: Improve chatbot interactions by evaluating responses to user requests against semantic criteria.
- RAG System Testing: Build a RAG pipeline and validate responses against a golden dataset.
- SQL Bot Optimization: Test Natural Language to SQL generation prompts by actually running generated queries against a database (using the API Endpoint step), followed by an evaluation of the results’ accuracy.
- Improving Summaries: Combine AI evaluating prompts and human graders to help improve prompts without a ground truth.