Test-driven Prompt Engineering
Similar to test-driven development (TDD) in software engineering, test-driven prompt engineering involves writing and running evaluations against new prompt versions before they are used in production. This proactive testing ensures that new prompts meet predefined criteria and behave as expected, minimizing the risk of unintended consequences. Setting up automatic evaluations on a specific prompt template is easy. When creating a new version, after adding a commit message, you will be prompted to select an evaluation pipeline to run. After doing this once, every new prompt template you create will run this pipeline by default. NOTE: Make sure your evaluation pipeline uses the “latest” version of the prompt template in its column step. The template is fetched at runtime. If you specify a frozen version, the evaluation report won’t reflect your newest prompt template.
Testing Strategies
Backtesting
Backtesting involves running new prompt versions against a dataset compiled from historical production data. This strategy provides a real-world context for evaluating prompts, allowing you to assess how new versions would have performed under past conditions. It’s an effective way to detect potential regressions and validate improvements, ensuring that updates enhance rather than detract from the user experience. To set up backtests, follow the steps below: 1. Create a historical dataset
- That use
my_prompt_template
version 6 or version 5 - That were made in the last 2 months
- That were using the tag
prod
- That users gave a 👍 response to

Regression Testing
Regression testing is the continuous refinement of evaluation datasets to include new edge cases and scenarios as they are discovered. This iterative process ensures that prompts remain robust against a growing set of challenges, preventing regressions in areas previously identified as potential failure points. By continually updating evaluations with new edge cases, you maintain a high standard of prompt quality and reliability. The process of setting up regression tests looks similar to backtesting. Create a dataset containing test cases for every edge case you can think of. The dataset should include context variables that you can input to your prompt template.Scoring
The evaluation can result in a single quantitative final score. To configure the score card, all you need to do is make sure that the last step consists entirely of numbers or Booleans. A final objective score makes comparing prompt performance easy, and it will be displayed alongside prompts in the Prompt Registry.