Skip to main content
Custom Scorers allow you to create reusable scoring logic for your benchmarks, making it easier to evaluate agent performance consistently and efficiently.

Why Use Custom Scorers?

Custom Scorers provide several key benefits:
  1. Reusability: Create scoring patterns once and reuse them across multiple benchmarks and scenarios
  2. Consistency: Ensure consistent evaluation criteria across different scenarios even as your scoring logic becomes more complex
  3. Flexibility: Easily customize scoring behavior using context variables

Best Practices for Scoring

When creating custom scorers, follow these best practices:
  • Provide Partial Credit: Design scorers that give granular feedback rather than binary pass/fail scores. This helps you understand with greater clarity differences in agent performance across your runs and help you improve your agent faster.
  • Avoid Cold Start Scoring: Structure scoring to prevent scenarios where most runs result in a 0 score. This helps us narrow down the root causes of an agent’s performance issues.
  • Use Context Variables: Leverage $RL_SCORER_CONTEXT to make scorers configurable and reusable

Creating a Custom Scorer

Here’s an example of creating a custom scorer that evaluates the length of an agent’s response written to a file:
import os
from runloop_api_client import Runloop

client = Runloop(
    bearer_token=os.environ.get("RUNLOOP_API_KEY"),  # This is the default and can be omitted
)
scorer = client.scenarios.scorers.create(
    bash_script="""
    #!/bin/bash

    # Parse the test context to get expected length and file path
    expected_length=$(echo "$RL_SCORER_CONTEXT" | jq -r '.expected_length')
    file_path=$(echo "$RL_SCORER_CONTEXT" | jq -r '.file_path')

    # Read the file contents
    file_contents=$(cat "$file_path")

    # Get the actual length by counting characters in file contents
    actual_length=$(echo -n "$file_contents" | wc -m)

    # Compare lengths and exit with appropriate code
    # Calculate difference between actual and expected length
    diff=$(( actual_length > expected_length ? actual_length - expected_length : expected_length - actual_length ))
    
    # Calculate score based on difference (1.0 when equal, decreasing linearly as difference increases)
    # Use bc for floating point math
    score=$(echo "scale=2; 1.0 - ($diff / $expected_length)" | bc)
    
    # Ensure score doesn't go below 0
    if (( $(echo "$score < 0" | bc -l) )); then
        echo "0.0"
    else
        echo "$score"
    fi
    """,
    type="my_custom_scorer_type",
)
print(scorer.id)

Using Custom Scorers in Scenarios

Once you’ve created a custom scorer, you can use it in your scenarios. Here’s an example that uses the scorer to evaluate if an agent writes a file with exactly 10 characters:
import os
from runloop_api_client import Runloop

client = Runloop(
    bearer_token=os.environ.get("RUNLOOP_API_KEY"),  # This is the default and can be omitted
)
scenario_view = client.scenarios.create(
    input_context={
        "problem_statement": "problem_statement"
    },
    name="name",
    scoring_contract={
        "scoring_function_parameters": [{
            "name": "my scorer",
            "scorer": {
              "type": "custom_scorer",
              "custom_scorer_type": "my_custom_scorer_type",
              "scorer_params": {
                "expected_length": 10,
                "file_path": "/home/user/file.txt"
              }
            },
            "weight": 1.0,
        }]
    },
)
print(scenario_view.id)