The score card feature in PromptLayer allows you to assign a score to each evaluation you run. This score provides a quick and easy way to assess the performance of your prompts and compare different versions.
By default, the score is calculated based on the last column in your evaluation results:
true
values.You can customize which columns are included in the score card calculation. When setting up your evaluation pipeline, click the “Score card” button to configure the score card.
Here, you can add specific columns to be included in the score calculation:
true
percentages for each selected column.These selected columns will also be formatted for more easy viewing in the evaluation report. You will see larger numbers, and check/x icons for booleans.
For more advanced scoring needs, you can provide your own custom scoring logic using Python or JavaScript code. The code execution environment is the same as the one used for the code execution evaluation column type (learn more).
This custom scoring logic can be used to generate a single score number or a drill-down matrix.
You can optionally return multiple drill-down matrices. This is useful for generating confusion matrices.
Your custom scoring code must return an object with the following keys:
score
(required): A number representing the overall score. This is mandatory.score_matrix
(optional): A list of lists of lists, representing one or more matrices of drilled-down scores. Each cell in these matrices can be a raw value or an object with metadata.Each cell in the score_matrix
can be either:
value
: The actual value of the cell, which can be a string or number.positive_metric
: (Optional) A boolean indicating whether an increase in this value is considered positive (true
). If absent, we default to true.Examples
42
{"value": 42, "positive_metric": true}
The optional positive_metric
property can be used to indicate how changes in the value should be interpreted when comparing evaluations. This is particularly useful for automated reporting and analysis tools.
To add titles to your score matrices, simply add an extra field to the first row of the matrix and it will automatically be interpreted as the primary title. For example, if you have a matrix like:
You can add a title by modifying it to:
The data
variable will be available in your scoring code, which is a list containing a dictionary for each row in the evaluation results. The keys in each dictionary correspond to the column names, and the values are the corresponding cell values.
For example:
You can compare two evaluation reports to see how scores and other metrics have changed between runs. Simply click the “Compare” button and select the evaluation reports you want to compare.
The score card and any score matrices will be displayed side-by-side for easy comparison of your prompt’s performance over time.
The score card feature in PromptLayer allows you to assign a score to each evaluation you run. This score provides a quick and easy way to assess the performance of your prompts and compare different versions.
By default, the score is calculated based on the last column in your evaluation results:
true
values.You can customize which columns are included in the score card calculation. When setting up your evaluation pipeline, click the “Score card” button to configure the score card.
Here, you can add specific columns to be included in the score calculation:
true
percentages for each selected column.These selected columns will also be formatted for more easy viewing in the evaluation report. You will see larger numbers, and check/x icons for booleans.
For more advanced scoring needs, you can provide your own custom scoring logic using Python or JavaScript code. The code execution environment is the same as the one used for the code execution evaluation column type (learn more).
This custom scoring logic can be used to generate a single score number or a drill-down matrix.
You can optionally return multiple drill-down matrices. This is useful for generating confusion matrices.
Your custom scoring code must return an object with the following keys:
score
(required): A number representing the overall score. This is mandatory.score_matrix
(optional): A list of lists of lists, representing one or more matrices of drilled-down scores. Each cell in these matrices can be a raw value or an object with metadata.Each cell in the score_matrix
can be either:
value
: The actual value of the cell, which can be a string or number.positive_metric
: (Optional) A boolean indicating whether an increase in this value is considered positive (true
). If absent, we default to true.Examples
42
{"value": 42, "positive_metric": true}
The optional positive_metric
property can be used to indicate how changes in the value should be interpreted when comparing evaluations. This is particularly useful for automated reporting and analysis tools.
To add titles to your score matrices, simply add an extra field to the first row of the matrix and it will automatically be interpreted as the primary title. For example, if you have a matrix like:
You can add a title by modifying it to:
The data
variable will be available in your scoring code, which is a list containing a dictionary for each row in the evaluation results. The keys in each dictionary correspond to the column names, and the values are the corresponding cell values.
For example:
You can compare two evaluation reports to see how scores and other metrics have changed between runs. Simply click the “Compare” button and select the evaluation reports you want to compare.
The score card and any score matrices will be displayed side-by-side for easy comparison of your prompt’s performance over time.