This module requires an existing prompt in your PromptLayer account. Please follow the Getting Started guide to create one if needed.

Prompt engineering is the art of building AI systems that respond correctly to all user inputs. Evaluations, unit tests, and historical backtests are the key to writing reliable prompts.

Create Test Datasets

To run a prompt evaluation, you must first create a dataset that is filled with a diverse set of sample test cases. Each row in a dataset contains test input values, and an evaluation is when a prompt is run over the batch of sample inputs.

For example, using a prompt that generates haiku poetry, a dataset could include a single column of sample topics and languages for the poetry.

haiku_dataset.csv
Topic,Language
American History,English
Forbidden Love,Spanish
Rock n Roll Music,English

In PromptLayer, you create a dataset from scratch, upload a CSV, or build a dataset from your request history.

Learn more about datasets here.

Instructions

  1. Navigate to the Datasets page in the left sidebar.

  2. Click the Create Dataset button.

    • If you already have a CSV file, upload it or optionally download an example dataset here.
    • To create one from scratch, use the Add Column and Add Row buttons.
    • To build a dataset from historical logs, click Add from Request History.
  3. Don’t forget to save the dataset!


Configure Evaluation Metrics and Scoring Criteria

There are many ways to evaluate a prompt. Running evaluatins in PromptLayer is flexible, allowing for backtests, LLM-as-judge evals, deterministic heuristics, or all three.

To start, let’s think about how a human might evaluate prompt results and build an evaluatin around these heuristics. In our example of an AI poetry writer, this could include syllable count, rhyme scheme, and general coherence.

It’s always best to start with the simplest eval possible, so let’s just add one LLM-as-judge check to verify that the result is an actual haiku.

Instructions

  1. Click on Evaluate in the navigation bar to get to the evaluations page.

  2. Click New Batch Run to create a new eval test set.

  3. Select the dataset we created earlier. We will call this “Haiku Evaluation”.

  4. An evaluation is simply a batch run of a prompt with subsequent evaluation steps. Start by adding the prompt as a new step.

    • Click Add Step
    • Select Prompt Template as the step type
    • Select the “ai-poet” prompt
    • Name the step “ai-haiku”
    • Match the input variable topic with one of the columns from your dataset. As the evaluation runs each row, this variable will be injected.
  5. Next, add an LLM-as-judge prompt to test if the output is a haiku.

    • Click Add Step
    • Select LLM Assertion as the step type
    • For the assertion, we will use This poem is strictly a haiku, following the standard syllable count
    • Name the step “valid-haiku”
    • Choose our previous column “ai-haiku” as the input to the assertion.

There are many step types you can use to build evals in PromptLayer. The most commonly used are:

  • Equality Comparison: Used to compare the prompt output against a ground truth.
  • Prompt Template: Used to chain prompts together or compare different models.
  • Cosine Similarity: Used to quantitatively compare how similar two text objects are.
  • Code Snippet: Python or Javascript code that can operate on LLM outputs.

See more eval examples here.


Run the Evaluation

While creating the evaluation, you will see the output results for the first four rows only. To run a full evaluation on the entire dataset, click Run Full Batch.

To customize the final score calculation of your evaluation, edit and tweak the scorecard.


Additional Resources:

  • Learn more about evals here
  • Read the guide to creating evals to compare LLM models against eachother.