Continuous Integration (CI) of prompt evaluations is the holy grail of prompt engineering. 🏆

CI in the context of prompt engineering involves the automated testing and validation of prompts every time a new version is created or updated. LLMs are a probabilistic technology. It is hard (read: virtually impossible) to ensure a new prompt version doesn’t break old user behavior just by eyeballing the prompt. Rigorous testing is the best tool we have.

We believe that it’s important to both allow subject-matter experts to write new prompts and provide them with tools to easily test if the prompts broke anything. That’s where PromptLayer evaluations comes in.

Test-driven Prompt Engineering

Similar to test-driven development (TDD) in software engineering, test-driven prompt engineering involves writing and running evaluations against new prompt versions before they are used in production. This proactive testing ensures that new prompts meet predefined criteria and behave as expected, minimizing the risk of unintended consequences.

Setting up automatic evaluations on a specific prompt template is easy. When creating a new version, after adding a commit message, you will be prompted to select an evaluation pipeline to run. After doing this once, every new prompt template you create will run this pipeline by default.

NOTE: Make sure your evaluation pipeline uses the “latest” version of the prompt template in its column step. The template is fetched at runtime. If you specify a frozen version, the evaluation report won’t reflect your newest prompt template.

Testing Strategies

Backtesting

Backtesting involves running new prompt versions against a dataset compiled from historical production data. This strategy provides a real-world context for evaluating prompts, allowing you to assess how new versions would have performed under past conditions. It’s an effective way to detect potential regressions and validate improvements, ensuring that updates enhance rather than detract from the user experience.

To set up backtests, follow the steps below:

1. Create a historical dataset

Create a dataset using a search query. For example, I might want to create a dataset using all logged requests:

  • That use my_prompt_template version 6 or version 5
  • That were made in the last 2 months
  • That were using the tag prod
  • That users gave a đź‘Ť response to

This dataset will help you understand if your new prompt version broke any previous versions!

2. Build an evaluation pipeline

The next step is to create an evaluation pipeline using our new historical dataset.

In plain English, this evaluation will feed in historical request context into your new prompt version then compare the new results to the old results. You can do a simple string comparison or get fancy with cosine similarities. PromptLayer will even show you a diff view for responses that are different.

3. Run it when you make a new version

This is the fun part. Next time you make a new prompt version, just select our new backtesting pipeline to see how the new prompt version fairs.

Regression Testing

Regression testing is the continuous refinement of evaluation datasets to include new edge cases and scenarios as they are discovered. This iterative process ensures that prompts remain robust against a growing set of challenges, preventing regressions in areas previously identified as potential failure points. By continually updating evaluations with new edge cases, you maintain a high standard of prompt quality and reliability.

The process of setting up regression tests looks similar to backtesting.

Create a dataset containing test cases for every edge case you can think of. The dataset should include context variables that you can input to your prompt template.

Scoring

The evaluation can result in a single quantitative final score. To configure the score card, all you need to do is make sure that the last step consists entirely of numbers or Booleans. A final objective score makes comparing prompt performance easy, and it will be displayed alongside prompts in the Prompt Registry.