The score card feature in PromptLayer allows you to assign a score to each evaluation you run. This score provides a quick and easy way to assess the performance of your prompts and compare different versions.

Configuring the Score Card

Default Configuration

By default, the score is calculated based on the last column in your evaluation results:

  • If the last column contains Booleans, the score will be the percentage of true values.
  • If the last column contains numbers, the score will be the average of those numbers.

Custom Column Selection

You can customize which columns are included in the score card calculation. When setting up your evaluation pipeline, click the “Score card” button to configure the score card.

Here, you can add specific columns to be included in the score calculation:

  • If you add multiple numeric columns, the total score will be the average of the averages for each selected column.
  • If you add multiple Boolean columns, the total score will be the average of the true percentages for each selected column.
  • Columns that do not contain numbers or Booleans will not be included in the score calculation.

These selected columns will also be formatted for more easy viewing in the evaluation report. You will see larger numbers, and check/x icons for booleans.

Custom Scoring Logic

For more advanced scoring needs, you can provide your own custom scoring logic using Python or JavaScript code. The code execution environment is the same as the one used for the code execution evaluation column type (learn more).

This custom scoring logic can be used to generate a single score number or a drill-down matrix.

You can optionally return multiple drill-down matrices. This is useful for generating confusion matrices.

Your custom scoring code must return an object with the following keys:

  • score (required): A number representing the overall score. This is mandatory.
  • score_matrix (optional): A list of lists of lists, representing one or more matrices of drilled-down scores. Each cell in these matrices can be a raw value or an object with metadata.

Score Matrix Cell Format

Each cell in the score_matrix can be either:

  • A raw value (string or number), or
  • An object with the following properties:
    • value: The actual value of the cell, which can be a string or number.
    • positive_metric: (Optional) A boolean indicating whether an increase in this value is considered positive (true). If absent, we default to true.

Examples

  • Simple value: 42
  • Object with metadata: {"value": 42, "positive_metric": true}

The optional positive_metric property can be used to indicate how changes in the value should be interpreted when comparing evaluations. This is particularly useful for automated reporting and analysis tools.

Code example

The data variable will be available in your scoring code, which is a list containing a dictionary for each row in the evaluation results. The keys in each dictionary correspond to the column names, and the values are the corresponding cell values.

For example:

Python
# The variable `data` is a list of rows.
# Each row is a dictionary of column name -> value
# For example: [
#       {'columnA': 1, 'columnB': 2},
#       {'columnA': 4, 'columnB': 1}
#  ]
# 
# Must return a dictionary with the following structure:
# {
#   'score': int,  # Required
#   'score_matrix': [[[int, int, ...], ...]...],  # Optional - list of lists of lists of integers
# }

return {
    'score': len(data),
    'score_matrix': [[
        ["Criteria", "Weight", "Value"],
        ["Correctness", 4, 7],
        ["Completeness", 3, 6],
        ["Accuracy", 5, 8],
        ["Relevance", 4, 9]
    ]],
}

Comparing Evaluation Reports

You can compare two evaluation reports to see how scores and other metrics have changed between runs. Simply click the “Compare” button and select the evaluation reports you want to compare.

The score card and any score matrices will be displayed side-by-side for easy comparison of your prompt’s performance over time.