> ## Documentation Index
> Fetch the complete documentation index at: https://docs.promptlayer.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation

> Measure the quality and effectiveness of your prompts through systematic testing.

<Warning>
  Legacy Evaluations, Reports, and Datasets are deprecated for new workflows. Use [Tables](/features/tables/overview) for new evaluation, dataset, report, backtesting, and batch workflows. See [Migrate from Evaluations and Datasets](/features/tables/migrate-from-evaluations-and-datasets).
</Warning>

<Tip>
  This module requires an existing prompt in your PromptLayer account. Please
  follow the [Getting Started](/onboarding-guides/getting-started) guide to
  create one if needed.
</Tip>

Prompt engineering is the art of building AI systems that respond correctly to all user inputs. Evaluations, unit tests, and historical backtests are the key to writing reliable prompts.

## Create Test Datasets

To run a prompt evaluation, you must first create a dataset that is filled with a diverse set of sample test cases. Each row in a dataset contains test input values, and an evaluation is when a prompt is run over the batch of sample inputs.

For example, using a prompt that generates haiku poetry, a dataset could include a single column of sample topics and languages for the poetry.

```csv haiku_dataset.csv theme={null}
Topic,Language
American History,English
Forbidden Love,Spanish
Rock n Roll Music,English
```

In PromptLayer, you create a dataset from scratch, upload a CSV, or build a dataset from your request history.

[Learn more about datasets here.](/features/evaluations/datasets-overview)

### Instructions

1. Navigate to the **Datasets** page in the left sidebar.
2. Click the **Create Dataset** button.
   * If you already have a CSV file, upload it or optionally download an example dataset <a href="./example-dataset/haiku.csv" download>here</a>.
   * To create one from scratch, use the **Add Column** and **Add Row** buttons.
   * To build a dataset from historical logs, click **Add from Request History**.
3. Don't forget to save the dataset!

   <video controls>
     <source src="https://mintcdn.com/promptlayer/v0RzaTvbzopITX7U/onboarding-guides/videos/create-dataset.mp4?fit=max&auto=format&n=v0RzaTvbzopITX7U&q=85&s=8575c027ca4599f16478a5a3e90008dc" type="video/mp4" data-path="onboarding-guides/videos/create-dataset.mp4" />
   </video>

***

## Configure Evaluation Metrics and Scoring Criteria

There are many ways to evaluate a prompt. Running [evaluatins in PromptLayer](/features/evaluations/overview) is flexible, allowing for backtests, LLM-as-judge evals, deterministic heuristics, or all three.

To start, let's think about how a human might evaluate prompt results and build an evaluatin around these heuristics. In our example of an AI poetry writer, this could include syllable count, rhyme scheme, and general coherence.

It's always best to start with the simplest eval possible, so let's just add one LLM-as-judge check to verify that the result is an actual haiku.

### Instructions

1. Click on **Evaluate** in the navigation bar to get to the evaluations page.

2. Click **New Batch Run** to create a new eval test set.

3. Select the dataset we created earlier. We will call this "Haiku Evaluation".

4. An evaluation is simply a batch run of a prompt with subsequent evaluation steps. Start by adding the prompt as a new step.

   * Click **Add Step**
   * Select **Prompt Template** as the step type
   * Select the "ai-poet" prompt
   * Name the step "ai-haiku"
   * Match the input variable `topic` with one of the columns from your dataset. As the evaluation runs each row, this variable will be injected.

5. Next, add an LLM-as-judge prompt to test if the output is a haiku.

   * Click **Add Step**
   * Select **LLM Assertion** as the step type
   * For the assertion, we will use `This poem is strictly a haiku, following the standard syllable count`
   * Name the step "valid-haiku"
   * Choose our previous column "ai-haiku" as the input to the assertion.

<video controls className="w-full aspect-video" src="https://mintcdn.com/promptlayer/v0RzaTvbzopITX7U/onboarding-guides/videos/run-eval.mp4?fit=max&auto=format&n=v0RzaTvbzopITX7U&q=85&s=4e0832b688f0f7fc0c0f0ebc44758edc" data-path="onboarding-guides/videos/run-eval.mp4" />

<Info>
  There are many [step types](/features/evaluations/eval-types) you can use to build evals in PromptLayer. The most commonly used are:

  * Equality Comparison: Used to compare the prompt output against a ground truth.
  * Prompt Template: Used to chain prompts together or compare different models.
  * Cosine Similarity: Used to quantitatively compare how *similar* two text objects are.
  * Code Snippet: Python or Javascript code that can operate on LLM outputs.

  [See more eval examples here.](/features/evaluations/examples)
</Info>

***

## Run the Evaluation

While creating the evaluation, you will see the output results for the first four rows only. To run a full evaluation on the entire dataset, click **Run Full Batch**.

<Tip>
  To customize the final score calculation of your evaluation, edit and tweak the [scorecard](/features/evaluations/score-card).
</Tip>

***

**Additional Resources:**

* Learn more about evals [here](/features/evaluations/overview)
* Read [the guide](https://blog.promptlayer.com/migrating-prompts-to-open-source-models/) to creating evals to compare LLM models against eachother.
