Eval Types

This page provides an overview of the various evaluation column types available on our platform.

Primary Types

Prompt Template

The Prompt Template evaluation type allows you to execute a prompt template from the Prompt Registry. You have the flexibility to select the latest version, a specific label, or a particular version of the prompt template. You also have the ability to assign the input variables based on available inputs from the dataset or other columns. You can override the model parameters that are set in the Prompt Registry. This functionality is particularly useful for testing a prompt template within a larger evaluation pipeline, comparing different model parameters, or implementing an “LLM as a judge” prompt template.

Custom API Endpoint

The Custom API Endpoint enables you to set up a webhook that our system will call (POST) with all the columns to the left of the API endpoint when that cell is executed. As cells are processed sequentially, we will call this endpoint with all the columns to the left as the given payload, and the returned result will be displayed. This feature allows for extensive customization to accommodate specific use cases and integrate with external systems or custom evaluators. The payload will be in the form of

    {
        data: {
            "column1": "value1",
            "column2": "value2",
            "column3": "value3",
            ...
        }
    }

MCP

The MCP Action allows you to run functions on a remote MCP server. Simply plug in your server URL and auth, select your function and you will be able to call your function with inputs mapped from other cells. For more information about MCP check out the official MCP docs.

Human Input

The Human Input evaluation type allows the addition of either numeric or text input where an evaluator can provide feedback via a slider or a text box. This input can then be utilized in subsequent columns in the evaluation pipeline, allowing for the incorporation of human judgment.

Code Execution

The Code Execution evaluation type allows you to write and execute code for each row in your dataset. You can access the data through the data variable and return the cell value. Note that stdout will be ignored. There is a 6 minute timeout for code execution. Code example to return a list of the names of each column:

message = "These are my column names: "
columns = [column_name for column_name in data.keys()]
return message + str(columns)

Python Runtime

The Python runtime runs Python 3.12.0 with no filesystem. Runtime does have network access. Only the standard library is available. Here are the resource quotas:

- Input code size: 1MiB
- Size of stdin: 10MiB
- Size of stdout: 20MiB
- Size of stderr: 10MiB
- Number of environment variables: 100
- Environment variable key size: 4KiB
- Environment variable value size: 100KiB
- Number of arguments: 100
- Argument size: 100KiB
- Memory consumption: 128MiB

JavaScript Runtime

The JavaScript runtime is built on Mozilla's SpiderMonkey engine with no filesystem. Runtime does have network access. It is not node or deno. Available APIs include:

- Legacy Encoding: atob, btoa, decodeURI, encodeURI, decodeURIComponent, encodeURIComponent
- Streams: ReadableStream, ReadableStreamBYOBReader, ReadableStreamBYOBRequest, ReadableStreamDefaultReader, ReadableStreamDefaultController, ReadableByteStreamController, WritableStream, ByteLengthQueuingStrategy, CountQueuingStrategy, TransformStream
- URL: URL, URLSearchParams
- Console: console
- Performance: Performance
- Task: queueMicrotask, setInterval, setTimeout, clearInterval, clearTimeout
- Location: WorkerLocation, location
- JSON: JSON
- Encoding: TextEncoder, TextDecoder, CompressionStream, DecompressionStream
- Structured Clone: structuredClone
- Fetch: fetch, Request, Response, Headers
- Crypto: SubtleCrypto, Crypto, crypto, CryptoKey

Resource Quotas:

- Input code size: 1MiB
- Size of stdin: 10MiB
- Size of stdout: 20MiB
- Size of stderr: 10MiB
- Number of environment variables: 100
- Environment variable key size: 4KiB
- Environment variable value size: 100KiB
- Number of arguments: 100
- Argument size: 100KiB
- Memory consumption: 128MiB

Conversation Simulator

The Conversation Simulator evaluation type automates the back-and-forth between your AI agent and simulated users to test conversational AI performance. This is particularly useful for evaluating multi-turn conversations where context maintenance, goal achievement, and user interaction patterns are critical. When setting up the conversation simulator:

Select your AI agent prompt template from the Prompt Registry
Pass in user details or context variables from your dataset
Define a test persona that challenges your AI with specific behaviors or constraints

Example Test Persona:

User is nervous about seeing the doctor, hasn't been in a long time, 
won't share phone number until asked three times for it

Optional Advanced Configuration:

User Goes First: By default, the AI agent initiates the conversation. You can enable this setting to have the simulated user start the conversation instead.
Conversation Samples: You can provide sample conversations to help guide the simulated user’s responses. These samples help maintain consistent voice and interaction patterns, ensuring the simulated user behaves realistically and consistently with your expected user base.

The conversation results are returned as a JSON list of messages that can then be evaluated using other eval types like LLM Assertions to assess success criteria.

Simple Evals

Equality Comparison

Equality Comparison allows you to compare two different columns as strings. It provides a visual diff if there is a difference between the columns. Note that the diff is not used when calculating the score in that column and the column will be treated as a boolean for the purposes of a score. If there is no difference, it this column return true.

Contains Value

The Contains evaluation type enables you to search for a substring within a column. For instance, you could search for a specific word or phrase within each cell in the column. It is using the python in operator to check if the substring is in the cell and is case insensitive.

Regex Match

The Regex Match evaluation type allows you to define a regular expression pattern to search within the column. This provides powerful pattern matching capabilities for complex text analysis tasks.

Absolute Numeric Distance

The Absolute Numeric Distance evaluation type allows you to select two different columns and output the absolute distance between their numeric values in a new column. Both source columns must contain numeric values.

LLM Evals

Run LLM Assertion

The LLM Assertion evaluation type enables you to run an assertion on a column using natural language prompts. You can create prompts such as “Does this contain an API key?”, “Is this sensitive content?”, or “Is this in English?”. Our system uses a backend prompt template that processes your assertion and returns either true or false. Assertions should be framed as questions.

Cosine Similarity

Cosine Similarity allows you to compare the vector distance between two columns. The system takes the two columns you supply, converts them into strings, and then embeds them using OpenAI’s embedding vectors. It then calculates the cosine similarity, resulting in a number between 0 and 1. This metric is useful for understanding how semantically similar two bodies of text are, which can be valuable in assessing topic adherence or content similarity.

Helper Functions

JSON Extraction

The JSON Extraction evaluation type allows you to define a JSON path and extract either the first match or all matches in that path. We will automatically cast the source column into a JSON object. This is particularly useful for parsing structured data within your evaluations.

Parse Value

The Parse Value column type enables you to convert another column into one of the following value types: string, number, Boolean, or JSON.

Static Value

The Static Value evaluation type allows you to pre-populate a column with a specific value. This is useful for adding constant values or context that you may need to use later in one of the other columns in your evaluation pipeline.

Type Validation

Type Validation returns a boolean for the given source column if it fits one of the specified types. The types supported for validation are JSON, number, or SQL. It will return true if the value is valid for the specified type, and false otherwise. For SQL validation, the system utilizes the SQLGlot library.

Coalesce

The Coalesce evaluation type allows you to take multiple different columns and coalesce them, similar to SQL’s COALESCE function.

Count

The Count evaluation type allows you to select a source column and count either the characters, words, or paragraphs within it. This will output a numeric value, which can be useful for analyzing the length or complexity of LLM outputs. Please reach out to us if you have any other evaluation types you would like to see on the platform. We are always looking to expand our evaluation capabilities to better serve your needs.

Get Started

Languages & Environments

Usage Documentation

Why PromptLayer?

Reference

Primary Types

Prompt Template

Custom API Endpoint

MCP

Human Input

Code Execution

Conversation Simulator

Simple Evals

Equality Comparison

Contains Value

Regex Match

Absolute Numeric Distance

LLM Evals

Run LLM Assertion

Cosine Similarity

Helper Functions

JSON Extraction

Parse Value

Static Value

Type Validation

Coalesce

Count

Get Started

Languages & Environments

Usage Documentation

Why PromptLayer?

Reference

​Primary Types

​Prompt Template

​Custom API Endpoint

​MCP

​Human Input

​Code Execution

​Conversation Simulator

​Simple Evals

​Equality Comparison

​Contains Value

​Regex Match

​Absolute Numeric Distance

​LLM Evals

​Run LLM Assertion

​Cosine Similarity

​Helper Functions

​JSON Extraction

​Parse Value

​Static Value

​Type Validation

​Coalesce

​Count

Primary Types

Prompt Template

Custom API Endpoint

MCP

Human Input

Code Execution

Conversation Simulator

Simple Evals

Equality Comparison

Contains Value

Regex Match

Absolute Numeric Distance

LLM Evals

Run LLM Assertion

Cosine Similarity

Helper Functions

JSON Extraction

Parse Value

Static Value

Type Validation

Coalesce

Count