Recommended Workflow
We recommend a systematic approach to implementing automated evaluations: This approach enables two powerful use cases:1. Nightly Evaluations (Production Monitoring)
Run scheduled evaluations to ensure nothing has changed in your production system. The score can be sent to Slack or your alerting system with a direct link to the evaluation pipeline. This helps detect production issues by sampling a wide range of requests and comparing against expected performance.2. CI/CD Integration
Trigger evaluations in your CI/CD pipeline (GitHub, GitLab, etc.) whenever relevant PRs are created. Wait for the evaluation score before proceeding with deployment, to make sure that your changes do not break anything.Step-by-Step Implementation
Step 1: Create a Dataset
To run evaluations, you’ll need a dataset against which to test your prompts. PromptLayer now provides a comprehensive set of APIs for dataset management:1.1 Create a Dataset Group
First, create a dataset group to organize your datasets:- Endpoint:
POST /api/public/v2/dataset-groups
- Description: Create a new dataset group within a workspace
- Authentication: JWT or API key
- Docs Link: Create Dataset Group
1.2 Create a Dataset Version
Once you have a dataset group, you can create dataset versions using two methods:Option A: From Request History
Create a dataset from your existing request logs:- Endpoint:
POST /api/public/v2/dataset-versions/from-filter-params
- Description: Create a dataset version by filtering request logs
- Authentication: API key only
- Docs Link: Create Dataset Version from Filter Params
Option B: From File Upload
Upload a CSV or JSON file to create a dataset:- Endpoint:
POST /api/public/v2/dataset-versions/from-file
- Description: Create a dataset version by uploading a file
- Authentication: API key only
- Docs Link: Create Dataset Version from File
Step 2: Create an Evaluation Pipeline
We highly reccomend building an evaluation pipeline via our dashboard. You can still create your evaluation pipeline (report) by making a POST request to/reports
with a name and dataset ID.
But this method is less user-friendly and does not provide the same level of configuration options as the dashboard.
Example Payload
Step 3: Configure Pipeline Steps
The evaluation pipeline consists of steps, each referred to as a “report column”. Configure these by making POST requests to/report-columns
for each desired step.
Example: Prompt Template Step
Example: API Endpoint Step
Step 4: Trigger the Evaluation
Once your pipeline is configured, trigger it programmatically using the run endpoint:- Endpoint:
POST /reports/{report_id}/run
- Description: Execute the evaluation pipeline with optional dataset refresh
- Docs Link: Run Evaluation Pipeline
Example Payload
Step 5: Monitor and Retrieve Results
You have two options for monitoring evaluation progress:Option A: Polling
Continuously check the report status until completion:- Endpoint:
GET /reports/{report_id}
- Description: Retrieve the status and results of a specific report by its ID.
- Docs Link: Get Report Status
Option B: Webhooks
Listen for thereport_finished
webhook event for real-time notifications when evaluations complete.
Step 6: Get the Score
Once the evaluation is complete, retrieve the final score:- Endpoint:
GET /reports/{report_id}/score
- Description: Fetch the score of a specific report by its ID.
- Docs Link: Get Evaluation Score