React agent eval with hotpotqa

This cookbook shows how to create a real ReAct-style agent eval with the TypeScript Evaris SDK. The example models a research agent used by support or operations teams during escalations. The agent needs to gather evidence from multiple sources before answering, so a ReAct runner with search and browser tools is a better fit than a single-shot generation flow. For the dataset, this recipe uses the Hugging Face ParthMandaliya/hotpot_qa distractor split. It is a strong stand-in for real research workflows because each sample usually requires multi-hop retrieval and answer synthesis rather than simple recall.

What this recipe does

Loads a Hugging Face dataset directly from the eval definition
Maps dataset fields from question and answer into Evaris input and target
Configures an inspect_native runner in react mode
Gives the agent web_search and web_browser tools
Uses the platform SDK auth flow so the client can create or reuse an agent eval suite automatically
Scores the run with lexical and judge-based scorers

Environment

Set these variables before running the script:

export EVARIS_PLATFORM_URL=http://localhost:3000
export EVARIS_RUNTIME_URL=http://127.0.0.1:8100
export EVARIS_PROJECT_ID=proj_123
export EVARIS_API_KEY=evr_pk_xxx
export EVARIS_MODEL=openrouter/openai/gpt-4o-mini
export EVARIS_JUDGE_MODEL=openrouter/openai/gpt-4o-mini
export EVARIS_HOTPOT_LIMIT=25
export EVARIS_WEB_SEARCH_PROVIDER=tavily
export EVARIS_ENABLE_WEB_BROWSER=true

Optional:

export EVARIS_SUITE_ID=es_123

If you omit EVARIS_WEB_SEARCH_PROVIDER, the example will skip the web_search tool. Set it to tavily, google, or exa only when that provider is configured in the runtime environment. If you omit EVARIS_ENABLE_WEB_BROWSER or set it to false, the example will skip the web_browser tool. Set it to true only when your sandbox image includes the Inspect web browser service. If EVARIS_SUITE_ID is omitted, the SDK will ensure a suite automatically when you use platformApiKeyAuth(...).

TypeScript example

import { eval_, runtime } from "@evaris/sdk";

function requireEnv(name: string): string {
  const value = process.env[name]?.trim();
  if (!value) {
    throw new Error(`Missing required env var ${name}`);
  }
  return value;
}

function readPositiveInt(name: string, fallback: number): number {
  const raw = process.env[name]?.trim();
  if (!raw) {
    return fallback;
  }

  const value = Number.parseInt(raw, 10);
  if (!Number.isFinite(value) || value <= 0) {
    throw new Error(`${name} must be a positive integer`);
  }
  return value;
}

async function main(): Promise<void> {
  const runtimeBaseUrl =
    process.env.EVARIS_RUNTIME_URL?.trim() ?? "http://127.0.0.1:8100";
  const platformBaseUrl =
    process.env.EVARIS_PLATFORM_URL?.trim() ?? "http://localhost:3000";
  const projectId = requireEnv("EVARIS_PROJECT_ID");
  const apiKey = requireEnv("EVARIS_API_KEY");
  const suiteId = process.env.EVARIS_SUITE_ID?.trim();
  const agentModel =
    process.env.EVARIS_MODEL?.trim() ?? "openrouter/openai/gpt-4o-mini";
  const judgeModel =
    process.env.EVARIS_JUDGE_MODEL?.trim() ?? "openrouter/openai/gpt-4o-mini";
  const limit = readPositiveInt("EVARIS_HOTPOT_LIMIT", 25);
  const webSearchProvider = process.env.EVARIS_WEB_SEARCH_PROVIDER?.trim();
  const enableWebBrowser =
    (process.env.EVARIS_ENABLE_WEB_BROWSER?.trim().toLowerCase() ?? "") ===
    "true";
  const agentTools = [
    ...(webSearchProvider
      ? [eval_.tools.webSearch({ provider: webSearchProvider })]
      : []),
    ...(enableWebBrowser
      ? [eval_.tools.webBrowser({ interactive: false })]
      : []),
  ];

  const client = new runtime.Client({
    baseUrl: runtimeBaseUrl,
    auth: runtime.platformApiKeyAuth({
      platformBaseUrl,
      projectId,
      apiKey,
    }),
  });

  const evalDefinition = eval_.define({
    suite_id: suiteId,
    id: "react-support-research-hotpotqa",
    description:
      "ReAct research agent eval for support and operations escalations using HotpotQA.",
    data: eval_.datasets.huggingface("ParthMandaliya/hotpot_qa", {
      name: "distractor",
      split: "validation",
      limit,
      shuffle: true,
      seed: 17,
      sample_fields: {
        input: "question",
        target: "answer",
      },
    }),
    run: eval_.agentRunner({
      type: "inspect_native",
      mode: "react",
      model: {
        name: agentModel,
      },
      steps: [
        eval_.steps.systemMessage(
          [
            "You are a support escalation research agent.",
            "Investigate the question carefully before answering.",
            "Use tools when you need evidence.",
            "Return a concise final answer with no extra commentary.",
          ].join("\n"),
        ),
        eval_.steps.userMessage(
          [
            "Customer escalation question:",
            "{input}",
            "",
            "Work like a ReAct agent: search, inspect sources, then answer.",
          ].join("\n"),
        ),
        eval_.steps.agent(
          agentTools,
          {
            prompt:
              "Prefer a short search loop. Verify the answer before you submit it.",
            messageLimit: 12,
            maxAttempts: 2,
            truncation: "auto",
          },
        ),
      ],
      config: {
        use_case: "support-escalation-research",
      },
    }),
    score: [
      eval_.scorers.f1(),
      eval_.scorers.match({
        location: "any",
        ignoreCase: true,
        ignoreWhitespace: true,
        ignorePunctuation: true,
      }),
      eval_.scorers.modelGradedQa({
        model: judgeModel,
        partialCredit: true,
        instructions: [
          "Grade the answer on factual correctness.",
          "Reward answers that resolve the question directly.",
          "Penalize unsupported claims and missing key entities.",
        ].join(" "),
      }),
    ],
    channel: "sdk",
    labels: {
      cookbook: "react-agent",
      dataset: "hotpotqa",
      use_case: "support-research",
    },
    params: {
      cookbook: {
        scenario: "react-support-research",
        dataset: "ParthMandaliya/hotpot_qa",
        split: "validation",
      },
    },
  });

  const submitted = await client.submitEval(evalDefinition);
  const run = await client.waitForRun(submitted.job_id, {
    pollIntervalMs: 5_000,
    timeoutMs: 30 * 60 * 1_000,
  });

  console.log(
    JSON.stringify(
      {
        job_id: submitted.job_id,
        suite_id: run.suite_id,
        status: run.status,
      },
      null,
      2,
    ),
  );
}

main().catch((error: unknown) => {
  console.error(error);
  process.exitCode = 1;
});

Why this is a good ReAct eval

The task is retrieval-heavy and usually requires combining facts from multiple sources.
The agent has to decide when to search, when to open a source, and when to stop.
The scorer stack balances cheap lexical checks with a more realistic judge-based quality signal.
The suite type inferred by the SDK is agent, so this lands in the agent-eval path rather than a plain model eval path.

Running the example

Build the SDK first, then run the script with your usual TypeScript runner. If you want a fast smoke test, start with:

EVARIS_HOTPOT_LIMIT=5
a smaller agent model
the judge scorer removed or pointed at a cheaper model

After the run completes, inspect the run in Evaris to review:

per-sample answers
tool traces
score distributions
failure cases where the agent searched but still answered incorrectly

Getting Started

Concepts

Guides

Cookbook

React agent eval with hotpotqa

What this recipe does

Environment

TypeScript example

Why this is a good ReAct eval

Running the example

Getting Started

Concepts

Guides

Cookbook

Documentation Index

​What this recipe does

​Environment

​TypeScript example

​Why this is a good ReAct eval

​Running the example

What this recipe does

Environment

TypeScript example

Why this is a good ReAct eval

Running the example