Skip to main content
Start with the Runloop Quickstart to use the examples below.

Overview

Runloop Benchmarks are collections of one or more scenarios. A scenario is a single, self-contained test case where an agent is given a problem and is expected to modify a target environment to solve it. Once created, scenarios can be run as many times as you want with different agents, parameters and configurations.

Creating Custom Scenarios

Creating custom scenarios allows users to tailor problem statements and environments specific to their needs. This is useful for testing or training agents under controlled conditions or building unique challenges. To define your own scenario:
  1. Create a Devbox image for running your scenario by either building a Blueprint (eg, from a Dockerfile) or snapshotting an existing Devbox
  2. Define a scoring function to evaluate the outcome of the scenario. The scoring function must return a score between 0 (fail) and 1 (pass).
  3. Create a problem statement that describes the task the agent must complete.
  4. Configure a reference_output; this is a known good output that the agent must achieve, sometimes referred to as the “gold patch” or “canonical solution”.
  5. Create a scenario using the blueprint, problem statement, environment parameters and scoring function.
Example:
devbox = await runloop.devbox.create(blueprint_name="bpt_123")
my_snapshot = await devbox.snapshot_disk(
  name="div incorrectly centered in flexbox"
)

my_new_scenario = await runloop.api.scenarios.create(
  name="My New Scenario",
  input_context={"problem_statement": "Create a UI component"},
  environment_parameters={"snapshot_id": my_snapshot.id},
  scoring_contract={
    "scoring_function_parameters": [{
      "name": "bash_scorer",
      "scorer": {
        "type": "bash_script_scorer",
        "bash_script": "echo 0.0"
      },
      "weight": 1.0
    }]
  },
  reference_output="echo 1.0"
)

Understanding Scoring Functions

Scoring functions are standalone scripts that validate whether a scenario was successfully completed. These functions grade solutions for correctness and assign a score for evaluation. The score is captured by runloop and used to evaluate the overall performance of a benchmark.

Basic Scoring Function Example

A simple scoring function is a bash script that echoes a score between 0 (failure) and 1 (success):
scoring_function_parameters = [{
  "name": "my-custom-pytest-script",
  "scorer": {
    "name": "bash_scorer",
    "type": "bash_script_scorer",
    "bash_script": "echo 0.0"
  },
  "weight": 1.0
}]

Custom Scoring Functions

To make scoring more reusable and flexible, you can define custom scoring functions. These are used to evaluate performance in specific ways, such as running tests or analyzing output logs. Example:
my_custom_scenario = await runloop.api.scenarios.create(
  name="scenario with custom scorer",
  input_context={"problem_statement": "Create a UI component"},
  environment_parameters={"snapshot_id": my_new_scenario.environment_parameters["snapshot_id"]},
  scoring_contract={
    "scoring_function_parameters": [{
      "name": "my-custom-pytest-script",
      "scorer": {
        "type": "custom_scorer",
        "custom_scorer_type": "my-custom-pytest-script",
        "scorer_params": {"relevant_tests": ["foo.test.py", "bar.test.py"]}
      },
      "weight": 1.0
    }]
  }
)
Note that many scenarios will use the same scoring function with different parameters, depending on the test case.

Custom benchmarks

Once you have your scenarios and scoring functions defined, you can run all of your custom scenarios as a custom benchmark. You’ll need to create the benchmark instance first, then run it. Here’s how:
my_benchmark = await runloop.api.benchmarks.create(
  name="py bench",
  scenario_ids=[my_new_scenario.id, my_custom_scenario.id]
)
You can update both code scenarios and benchmarks at any time so that you can build it up over time. You can also add or remove scenarios from a benchmark as needed.