This cookbook shows how to create a real ReAct-style agent eval with the TypeScript Evaris SDK. The example models a research agent used by support or operations teams during escalations. The agent needs to gather evidence from multiple sources before answering, so a ReAct runner with search and browser tools is a better fit than a single-shot generation flow. For the dataset, this recipe uses the Hugging FaceDocumentation Index
Fetch the complete documentation index at: https://docs.evarisai.com/llms.txt
Use this file to discover all available pages before exploring further.
ParthMandaliya/hotpot_qa distractor split. It is a strong stand-in for real research workflows because each sample usually requires multi-hop retrieval and answer synthesis rather than simple recall.
What this recipe does
- Loads a Hugging Face dataset directly from the eval definition
- Maps dataset fields from
questionandanswerinto Evarisinputandtarget - Configures an
inspect_nativerunner inreactmode - Gives the agent
web_searchandweb_browsertools - Uses the platform SDK auth flow so the client can create or reuse an agent eval suite automatically
- Scores the run with lexical and judge-based scorers
Environment
Set these variables before running the script:EVARIS_WEB_SEARCH_PROVIDER, the example will skip the
web_search tool. Set it to tavily, google, or exa only when that
provider is configured in the runtime environment.
If you omit EVARIS_ENABLE_WEB_BROWSER or set it to false, the example will
skip the web_browser tool. Set it to true only when your sandbox image
includes the Inspect web browser service.
If EVARIS_SUITE_ID is omitted, the SDK will ensure a suite automatically when you use platformApiKeyAuth(...).
TypeScript example
Why this is a good ReAct eval
- The task is retrieval-heavy and usually requires combining facts from multiple sources.
- The agent has to decide when to search, when to open a source, and when to stop.
- The scorer stack balances cheap lexical checks with a more realistic judge-based quality signal.
- The suite type inferred by the SDK is
agent, so this lands in the agent-eval path rather than a plain model eval path.
Running the example
Build the SDK first, then run the script with your usual TypeScript runner. If you want a fast smoke test, start with:EVARIS_HOTPOT_LIMIT=5- a smaller agent model
- the judge scorer removed or pointed at a cheaper model
- per-sample answers
- tool traces
- score distributions
- failure cases where the agent searched but still answered incorrectly