This page provides an overview of the various evaluation column types available on our platform.

Primary Types

Prompt Template

The Prompt Template evaluation type allows you to execute a prompt template from the Prompt Registry. You have the flexibility to select the latest version, a specific label, or a particular version of the prompt template. You also have the ability to assign the input variables based on available inputs from the dataset or other columns. You can override the model parameters that are set in the Prompt Registry. This functionality is particularly useful for testing a prompt template within a larger evaluation pipeline, comparing different model parameters, or implementing an “LLM as a judge” prompt template.

Custom API Endpoint

The Custom API Endpoint enables you to set up a webhook that our system will call (POST) with all the columns to the left of the API endpoint when that cell is executed. As cells are processed sequentially, we will call this endpoint with all the columns to the left as the given payload, and the returned result will be displayed. This feature allows for extensive customization to accommodate specific use cases and integrate with external systems or custom evaluators.

The payload will be in the form of

    {
        data: {
            "column1": "value1",
            "column2": "value2",
            "column3": "value3",
            ...
        }
    }

Human Input

The Human Input evaluation type allows the addition of either numeric or text input where an evaluator can provide feedback via a slider or a text box. This input can then be utilized in subsequent columns in the evaluation pipeline, allowing for the incorporation of human judgment.

Code Execution

The Code Execution evaluation type allows you to write and execute code for each row in your dataset. You can access the data through the data variable and return the cell value. Note that stdout will be ignored.

Code example to return a list of the names of each column:

Python Runtime

The Python runtime runs Python 3.12.0 with no filesystem or network access. Only the standard library is available. Here are the resource quotas:

- Input code size: 1MiB
- Size of stdin: 10MiB
- Size of stdout: 20MiB
- Size of stderr: 10MiB
- Number of environment variables: 100
- Environment variable key size: 4KiB
- Environment variable value size: 100KiB
- Number of arguments: 100
- Argument size: 100KiB
- Execution duration: 30 seconds
- Memory consumption: 128MiB

JavaScript Runtime

The JavaScript runtime is built on Mozilla's SpiderMonkey engine with no filesystem or network access. It is not node or deno. Available APIs include:

- Legacy Encoding: atob, btoa, decodeURI, encodeURI, decodeURIComponent, encodeURIComponent
- Streams: ReadableStream, ReadableStreamBYOBReader, ReadableStreamBYOBRequest, ReadableStreamDefaultReader, ReadableStreamDefaultController, ReadableByteStreamController, WritableStream, ByteLengthQueuingStrategy, CountQueuingStrategy, TransformStream
- URL: URL, URLSearchParams
- Console: console
- Performance: Performance
- Task: queueMicrotask, setInterval, setTimeout, clearInterval, clearTimeout
- Location: WorkerLocation, location
- JSON: JSON
- Encoding: TextEncoder, TextDecoder, CompressionStream, DecompressionStream
- Structured Clone: structuredClone
- Fetch: fetch, Request, Response, Headers
- Crypto: SubtleCrypto, Crypto, crypto, CryptoKey

Resource Quotas:

- Input code size: 1MiB
- Size of stdin: 10MiB
- Size of stdout: 20MiB
- Size of stderr: 10MiB
- Number of environment variables: 100
- Environment variable key size: 4KiB
- Environment variable value size: 100KiB
- Number of arguments: 100
- Argument size: 100KiB
- Execution duration: 30 seconds
- Memory consumption: 128MiB

Simple Evals

Equality Comparison

Equality Comparison allows you to compare two different columns as strings. It provides a visual diff if there is a difference between the columns. Note that the diff is not used when calculating the score in that column and the column will be treated as a boolean for the purposes of a score. If there is no difference, it this column return true.

Contains Value

The Contains evaluation type enables you to search for a substring within a column. For instance, you could search for a specific word or phrase within each cell in the column. It is using the python in operator to check if the substring is in the cell and is case insensitive.

Regex Match

The Regex Match evaluation type allows you to define a regular expression pattern to search within the column. This provides powerful pattern matching capabilities for complex text analysis tasks.

Absolute Numeric Distance

The Absolute Numeric Distance evaluation type allows you to select two different columns and output the absolute distance between their numeric values in a new column. Both source columns must contain numeric values.

LLM Evals

Run LLM Assertion

The LLM Assertion evaluation type enables you to run an assertion on a column using natural language prompts. You can create prompts such as “Does this contain an API key?”, “Is this sensitive content?”, or “Is this in English?”. Our system uses a backend prompt template that processes your assertion and returns either true or false. Assertions should be framed as questions.

Cosine Similarity

Cosine Similarity allows you to compare the vector distance between two columns. The system takes the two columns you supply, converts them into strings, and then embeds them using OpenAI’s embedding vectors. It then calculates the cosine similarity, resulting in a number between 0 and 1. This metric is useful for understanding how semantically similar two bodies of text are, which can be valuable in assessing topic adherence or content similarity.

Helper Functions

JSON Extraction

The JSON Extraction evaluation type allows you to define a JSON path and extract either the first match or all matches in that path. We will automatically cast the source column into a JSON object. This is particularly useful for parsing structured data within your evaluations.

Parse Value

The Parse Value column type enables you to convert another column into one of the following value types: string, number, Boolean, or JSON.

Static Value

The Static Value evaluation type allows you to pre-populate a column with a specific value. This is useful for adding constant values or context that you may need to use later in one of the other columns in your evaluation pipeline.

Type Validation

Type Validation returns a boolean for the given source column if it fits one of the specified types. The types supported for validation are JSON, number, or SQL. It will return true if the value is valid for the specified type, and false otherwise. For SQL validation, the system utilizes the SQLGlot library.

Coalesce

The Coalesce evaluation type allows you to take multiple different columns and coalesce them, similar to SQL’s COALESCE function.

Count

The Count evaluation type allows you to select a source column and count either the characters, words, or paragraphs within it. This will output a numeric value, which can be useful for analyzing the length or complexity of LLM outputs.

Please reach out to us if you have any other evaluation types you would like to see on the platform. We are always looking to expand our evaluation capabilities to better serve your needs.