Custom Scorers

Custom Scorers allow you to create reusable scoring logic for your benchmarks, making it easier to evaluate agent performance consistently and efficiently.

Why Use Custom Scorers?

Custom Scorers provide several key benefits:

Reusability: Create scoring patterns once and reuse them across multiple benchmarks and scenarios
Consistency: Ensure consistent evaluation criteria across different scenarios even as your scoring logic becomes more complex
Flexibility: Easily customize scoring behavior using context variables

Best Practices for Scoring

When creating custom scorers, follow these best practices:

Provide Partial Credit: Design scorers that give granular feedback rather than binary pass/fail scores. This helps you understand with greater clarity differences in agent performance across your runs and help you improve your agent faster.
Avoid Cold Start Scoring: Structure scoring to prevent scenarios where most runs result in a 0 score. This helps us narrow down the root causes of an agent’s performance issues.
Use Context Variables: Leverage $RL_SCORER_CONTEXT to make scorers configurable and reusable

Creating a Custom Scorer

Here’s an example of creating a custom scorer that evaluates the length of an agent’s response written to a file:

import os
from runloop_api_client import Runloop

client = Runloop(
    bearer_token=os.environ.get("RUNLOOP_API_KEY"),  # This is the default and can be omitted
)
scorer = client.scenarios.scorers.create(
    bash_script="""
    #!/bin/bash

    # Parse the test context to get expected length and file path
    expected_length=$(echo "$RL_SCORER_CONTEXT" | jq -r '.expected_length')
    file_path=$(echo "$RL_SCORER_CONTEXT" | jq -r '.file_path')

    # Read the file contents
    file_contents=$(cat "$file_path")

    # Get the actual length by counting characters in file contents
    actual_length=$(echo -n "$file_contents" | wc -m)

    # Compare lengths and exit with appropriate code
    # Calculate difference between actual and expected length
    diff=$(( actual_length > expected_length ? actual_length - expected_length : expected_length - actual_length ))
    
    # Calculate score based on difference (1.0 when equal, decreasing linearly as difference increases)
    # Use bc for floating point math
    score=$(echo "scale=2; 1.0 - ($diff / $expected_length)" | bc)
    
    # Ensure score doesn't go below 0
    if (( $(echo "$score < 0" | bc -l) )); then
        echo "0.0"
    else
        echo "$score"
    fi
    """,
    type="my_custom_scorer_type",
)
print(scorer.id)

Using Custom Scorers in Scenarios

Once you’ve created a custom scorer, you can use it in your scenarios. Here’s an example that uses the scorer to evaluate if an agent writes a file with exactly 10 characters:

import os
from runloop_api_client import Runloop

client = Runloop(
    bearer_token=os.environ.get("RUNLOOP_API_KEY"),  # This is the default and can be omitted
)
scenario_view = client.scenarios.create(
    input_context={
        "problem_statement": "problem_statement"
    },
    name="name",
    scoring_contract={
        "scoring_function_parameters": [{
            "name": "my scorer",
            "scorer": {
              "type": "custom_scorer",
              "custom_scorer_type": "my_custom_scorer_type",
              "scorer_params": {
                "expected_length": 10,
                "file_path": "/home/user/file.txt"
              }
            },
            "weight": 1.0,
        }]
    },
)
print(scenario_view.id)

Overview

Tools

Components

Why Use Custom Scorers?

Best Practices for Scoring

Creating a Custom Scorer

Using Custom Scorers in Scenarios

Overview

Tools

Components

​Why Use Custom Scorers?

​Best Practices for Scoring

​Creating a Custom Scorer

​Using Custom Scorers in Scenarios

Why Use Custom Scorers?

Best Practices for Scoring

Creating a Custom Scorer

Using Custom Scorers in Scenarios