> ## Documentation Index > Fetch the complete documentation index at: https://docs.promptlayer.com/llms.txt > Use this file to discover all available pages before exploring further. # Evaluation > Measure the quality and effectiveness of your prompts through systematic testing. Legacy Evaluations, Reports, and Datasets are deprecated for new workflows. Use [Tables](/features/tables/overview) for new evaluation, dataset, report, backtesting, and batch workflows. See [Migrate from Evaluations and Datasets](/features/tables/migrate-from-evaluations-and-datasets). This module requires an existing prompt in your PromptLayer account. Please follow the [Getting Started](/onboarding-guides/getting-started) guide to create one if needed. Prompt engineering is the art of building AI systems that respond correctly to all user inputs. Evaluations, unit tests, and historical backtests are the key to writing reliable prompts. ## Create Test Datasets To run a prompt evaluation, you must first create a dataset that is filled with a diverse set of sample test cases. Each row in a dataset contains test input values, and an evaluation is when a prompt is run over the batch of sample inputs. For example, using a prompt that generates haiku poetry, a dataset could include a single column of sample topics and languages for the poetry. ```csv haiku_dataset.csv theme={null} Topic,Language American History,English Forbidden Love,Spanish Rock n Roll Music,English ``` In PromptLayer, you create a dataset from scratch, upload a CSV, or build a dataset from your request history. [Learn more about datasets here.](/features/evaluations/datasets-overview) ### Instructions 1. Navigate to the **Datasets** page in the left sidebar. 2. Click the **Create Dataset** button. * If you already have a CSV file, upload it or optionally download an example dataset here. * To create one from scratch, use the **Add Column** and **Add Row** buttons. * To build a dataset from historical logs, click **Add from Request History**. 3. Don't forget to save the dataset!

*** ## Configure Evaluation Metrics and Scoring Criteria There are many ways to evaluate a prompt. Running [evaluatins in PromptLayer](/features/evaluations/overview) is flexible, allowing for backtests, LLM-as-judge evals, deterministic heuristics, or all three. To start, let's think about how a human might evaluate prompt results and build an evaluatin around these heuristics. In our example of an AI poetry writer, this could include syllable count, rhyme scheme, and general coherence. It's always best to start with the simplest eval possible, so let's just add one LLM-as-judge check to verify that the result is an actual haiku. ### Instructions 1. Click on **Evaluate** in the navigation bar to get to the evaluations page. 2. Click **New Batch Run** to create a new eval test set. 3. Select the dataset we created earlier. We will call this "Haiku Evaluation". 4. An evaluation is simply a batch run of a prompt with subsequent evaluation steps. Start by adding the prompt as a new step. * Click **Add Step** * Select **Prompt Template** as the step type * Select the "ai-poet" prompt * Name the step "ai-haiku" * Match the input variable `topic` with one of the columns from your dataset. As the evaluation runs each row, this variable will be injected. 5. Next, add an LLM-as-judge prompt to test if the output is a haiku. * Click **Add Step** * Select **LLM Assertion** as the step type * For the assertion, we will use `This poem is strictly a haiku, following the standard syllable count` * Name the step "valid-haiku" * Choose our previous column "ai-haiku" as the input to the assertion.