PromptLayer offers powerful options for configuring and running evaluation pipelines programmatically in your workflows. This is ideal for users who require the flexibility to run evaluations from code, enabling seamless integration with existing CI/CD pipelines or custom automation scripts.

We recommend a systematic approach to implementing automated evaluations:

Polling

Webhook

No

Yes

Create Dataset

Build Evaluation Pipeline

Configure Pipeline Steps

Trigger Evaluation

Monitor Progress

Check Report Status

Receive report_finished Event

Report Complete?

Get Score

Send Alerts/Take Action

This approach enables two powerful use cases:

1. Nightly Evaluations (Production Monitoring)

Run scheduled evaluations to ensure nothing has changed in your production system. The score can be sent to Slack or your alerting system with a direct link to the evaluation pipeline. This helps detect production issues by sampling a wide range of requests and comparing against expected performance.

2. CI/CD Integration

Trigger evaluations in your CI/CD pipeline (GitHub, GitLab, etc.) whenever relevant PRs are created. Wait for the evaluation score before proceeding with deployment, to make sure that your changes do not break anything.

Step-by-Step Implementation

Step 1: Create a Dataset

To run evaluations, you’ll need a dataset against which to test your prompts. You can create datasets from your request history programmatically via the API or directly on the dashboard by uploading a csv file.

  • Endpoint: /dataset-from-filter-params
  • Description: Create a dataset in PromptLayer programmatically from your request history.
  • Payload Filters: Include the required name parameter and workspace_id. Optionally define start_time and end_time to filter requests within a specific timeframe. Use metadata for key-value filtering, prompt_template for template-specific requests, and tags for additional categorization.

Example Payload

{
    "name": "production-sample-dataset",
    "workspace_id": 32,
    "tags_and": ["prod"],
    "metadata": [
        {
            "key": "environment",
            "value": "production"
        }
    ],
    "prompt_template": {
        "name": "my_prompt_template",
        "version_numbers": [11, 12]
    },
    "limit": 100
}

We will be updating the endpoints for datasets soon to enable:

  • More intuitive dataset creation
  • Uploading datasets from CSV files via API
  • Enhanced filtering options

Step 2: Create an Evaluation Pipeline

We highly reccomend building an evaluation pipeline via our dashboard. You can still create your evaluation pipeline (report) by making a POST request to /reports with a name and dataset ID. But this method is less user-friendly and does not provide the same level of configuration options as the dashboard.

Example Payload

{
    "name": "Nightly Production Evaluation",
    "test_dataset_id": 123
}

Step 3: Configure Pipeline Steps

The evaluation pipeline consists of steps, each referred to as a “report column”. Configure these by making POST requests to /report-columns for each desired step.

Example: Prompt Template Step

{
    "column_type": "PROMPT_TEMPLATE",
    "name": "prompt response",
    "configuration": {
        "engine": {
            "provider": "openai",
            "model": "gpt-4",
            "parameters": {
                "temperature": 0.1,
                "max_tokens": 512
            }
        },
        "template": {
            "name": "my_prompt_template",
            "version_number": null
        },
        "prompt_template_variable_mappings": {
            "query": "variable.query",
            "context": "variable.context"
        }
    },
    "report_id": report_id
}

Example: API Endpoint Step

{
    "column_type": "ENDPOINT",
    "name": "RAG-Enhancement",
    "configuration": {
        "url": "https://api.example.com/enhance"
    },
    "report_id": report_id
}

Step 4: Trigger the Evaluation

Once your pipeline is configured, trigger it programmatically using the run endpoint:

  • Endpoint: POST /reports/{report_id}/run
  • Description: Execute the evaluation pipeline with optional dataset refresh
  • Docs Link: Run Evaluation Pipeline

Example Payload

{
    "name": "Nightly Eval - 2024-12-15",
    "dataset_id": 123,
}

Step 5: Monitor and Retrieve Results

You have two options for monitoring evaluation progress:

Option A: Polling

Continuously check the report status until completion:

  • Endpoint: GET /reports/{report_id}
  • Description: Retrieve the status and results of a specific report by its ID.
  • Docs Link: Get Report Status
# Response includes status: "RUNNING" or "COMPLETED"
{
    "success": true,
    "report": {...},
    "status": "COMPLETED",
    "stats": {
        "status_counts": {
            "COMPLETED": 95,
            "FAILED": 3,
            "QUEUED": 0,
            "RUNNING": 2
        }
    }
}

Option B: Webhooks

Listen for the report_finished webhook event for real-time notifications when evaluations complete.

Step 6: Get the Score

Once the evaluation is complete, retrieve the final score:

  • Endpoint: GET /reports/{report_id}/score
  • Description: Fetch the score of a specific report by its ID.
  • Docs Link: Get Evaluation Score

Example Response

{
    "success": true,
    "message": "success",
    "score": {
        "overall_score": 87.5,
        "score_type": "multi_column",
        "has_custom_scoring": false,
        "details": {
            "columns": [
                {
                    "column_name": "Accuracy Check",
                    "score": 90.0,
                    "score_type": "boolean"
                },
                {
                    "column_name": "Safety Check",
                    "score": 85.0,
                    "score_type": "boolean"
                }
            ]
        }
    }
}