Evaluations and Analysis

WhyOps doesn’t just record what your agents did—it tells you why they failed and how to improve them. This is done through the Agent Analyses feature in the whyops-analyse service.

Static Analysis

Every trace in WhyOps is subjected to Static Analysis. This runs automatically and detects common structural failures without needing an LLM judge.

What it detects:

Missing Parent Steps: An event references a parent step that does not exist in the trace.
Orphan Tool Call Responses: A tool returned a result, but the request was never logged.
Missing Tool Call Responses: The agent requested a tool, but it never received a response (timeout or crash).
Identical Tool Call Loops: The agent called the exact same tool with the exact same arguments 3 or more times consecutively. (A classic infinite loop).
Consecutive Error Streaks: The agent encountered 3 or more errors in a row without recovering.
Latency Outliers: A step took significantly longer (e.g., > 1.5x the p95 latency) than the rest of the trace.
Token Outliers: A step consumed significantly more tokens than the p95 average for the trace.

Static analysis provides actionable recommendations for each finding, such as “Add a circuit breaker to avoid repeated failing attempts.”

LLM-as-a-Judge Analysis

For deeper insights, WhyOps supports Agent Analysis Runs using an LLM Judge. You can configure WhyOps to evaluate traces on specific dimensions.

Supported Dimensions

intent_precision: Did the agent correctly identify the user’s intent?
followup_repair: How well did the agent handle clarifying questions or recover from ambiguity?
answer_completeness_clarity: Was the final answer complete, accurate, and easy to understand?
tool_routing_quality: Did the agent select the right tool for the job?
tool_invocation_quality: Did the agent pass the correct arguments to the tool?
tool_output_utilization: Did the agent correctly interpret and use the output of the tool?
reliability_recovery: How gracefully did the agent handle errors or missing information?
latency_cost_efficiency: Was the agent’s path to the answer efficient?
conversation_ux: Was the tone and structure of the conversation appropriate?

Configuring Agent Analyses

You can configure analyses to run automatically via cron, or trigger them manually via the dashboard or API.

POST /api/agent-analyses/:agentId/run

{
  "lookbackDays": 7,
  "mode": "standard", // "quick", "standard", or "deep"
  "judgeModel": "gpt-4o",
  "dimensions": ["tool_routing_quality", "reliability_recovery"]
}

The analyses will score the agent’s traces and generate an Agent Knowledge Profile to help you understand your agent’s strengths and weaknesses.

Getting Started

SDK Overview

TypeScript SDK

Vercel AI SDK

LangChain JS (Beta)

Python SDK

Go SDK

Direct APIs

UI Features

Core Concepts

Architecture

Evaluations and Analysis

Static Analysis

What it detects:

LLM-as-a-Judge Analysis

Supported Dimensions

Configuring Agent Analyses

Getting Started

SDK Overview

TypeScript SDK

Vercel AI SDK

LangChain JS (Beta)

Python SDK

Go SDK

Direct APIs

UI Features

Core Concepts

Architecture

Documentation Index

​Static Analysis

​What it detects:

​LLM-as-a-Judge Analysis

​Supported Dimensions

​Configuring Agent Analyses

Static Analysis

What it detects:

LLM-as-a-Judge Analysis

Supported Dimensions

Configuring Agent Analyses