Online or Programmatic Evals
PromptLayer offers powerful options for configuring and running evaluation pipelines programmatically in your workflows. This is ideal for users who require the flexibility to run evaluations from code, enabling seamless integration with existing CI/CD pipelines or custom automation scripts.
Recommended Workflow
We recommend a systematic approach to implementing automated evaluations:
This approach enables two powerful use cases:
1. Nightly Evaluations (Production Monitoring)
Run scheduled evaluations to ensure nothing has changed in your production system. The score can be sent to Slack or your alerting system with a direct link to the evaluation pipeline. This helps detect production issues by sampling a wide range of requests and comparing against expected performance.
2. CI/CD Integration
Trigger evaluations in your CI/CD pipeline (GitHub, GitLab, etc.) whenever relevant PRs are created. Wait for the evaluation score before proceeding with deployment, to make sure that your changes do not break anything.
Step-by-Step Implementation
Step 1: Create a Dataset
To run evaluations, you’ll need a dataset against which to test your prompts. You can create datasets from your request history programmatically via the API or directly on the dashboard by uploading a csv file.
- Endpoint:
/dataset-from-filter-params
- Description: Create a dataset in PromptLayer programmatically from your request history.
- Payload Filters: Include the required
name
parameter andworkspace_id
. Optionally definestart_time
andend_time
to filter requests within a specific timeframe. Usemetadata
for key-value filtering,prompt_template
for template-specific requests, andtags
for additional categorization.
Example Payload
We will be updating the endpoints for datasets soon to enable:
- More intuitive dataset creation
- Uploading datasets from CSV files via API
- Enhanced filtering options
Step 2: Create an Evaluation Pipeline
We highly reccomend building an evaluation pipeline via our dashboard. You can still create your evaluation pipeline (report) by making a POST request to /reports
with a name and dataset ID.
But this method is less user-friendly and does not provide the same level of configuration options as the dashboard.
Example Payload
Step 3: Configure Pipeline Steps
The evaluation pipeline consists of steps, each referred to as a “report column”. Configure these by making POST requests to /report-columns
for each desired step.
Example: Prompt Template Step
Example: API Endpoint Step
Step 4: Trigger the Evaluation
Once your pipeline is configured, trigger it programmatically using the run endpoint:
- Endpoint:
POST /reports/{report_id}/run
- Description: Execute the evaluation pipeline with optional dataset refresh
- Docs Link: Run Evaluation Pipeline
Example Payload
Step 5: Monitor and Retrieve Results
You have two options for monitoring evaluation progress:
Option A: Polling
Continuously check the report status until completion:
- Endpoint:
GET /reports/{report_id}
- Description: Retrieve the status and results of a specific report by its ID.
- Docs Link: Get Report Status
Option B: Webhooks
Listen for the report_finished
webhook event for real-time notifications when evaluations complete.
Step 6: Get the Score
Once the evaluation is complete, retrieve the final score:
- Endpoint:
GET /reports/{report_id}/score
- Description: Fetch the score of a specific report by its ID.
- Docs Link: Get Evaluation Score