Evaluation & Optimization¶
The ADK includes dedicated libraries (GoogleAdk.Evaluation and GoogleAdk.Optimization) to systematically measure and improve the quality of your LLM prompts and agent responses.
1. Running Evaluations (LLM-as-a-Judge)¶
Instead of relying on manual testing, you can use a separate LLM agent (the "Judge") to mathematically grade the responses of your core agent based on defined criteria.
Setting up the Evaluation Set¶
An EvalSet defines a collection of test cases with simulated user conversations.
using GoogleAdk.Evaluation;
using GoogleAdk.Evaluation.Models;
var evalSet = new EvalSet
{
EvalSetId = "summarization-test",
EvalCases =
[
new EvalCase
{
EvalId = "case_gdpr",
Conversation = [ new Invocation { UserContent = new Content { Role = "user", Parts = [new Part { Text = "Summarize GDPR." }] } } ]
}
]
};
Performing Inference¶
Generate the initial responses from your core agent over the evaluation dataset.
var evalService = new LocalEvalService();
// Run the core agent against the dataset
var inferenceResults = await evalService.PerformInferenceAsync(myCoreRunner, evalSet);
Scoring with a Judge Evaluator¶
You can define a custom evaluator that utilizes a strict LLM judge to grade the responses between 0.0 and 1.0.
// The LLM-as-a-Judge evaluator configuration
var judgeAgent = new LlmAgent(new LlmAgentConfig
{
Name = "judge",
Model = "gemini-2.5-flash",
Instruction = "You are a strict grader. Output ONLY a number between 0 and 1."
});
var judgeRunner = new InMemoryRunner("judge", judgeAgent);
var evaluator = new LlmJudgeEvaluator(judgeRunner);
// Evaluate the original inference responses using the Judge
var scoredResults = await evalService.EvaluateAsync(evalSet, inferenceResults, [evaluator]);
foreach (var result in scoredResults)
{
var score = result.Invocations[0].Metrics[evaluator.Name].Score;
Console.WriteLine($"Case {result.EvalId} scored: {score:0.00}");
}
2. Prompt Optimization¶
The SimplePromptOptimizer utilizes the LLM to automatically generate variations of a system prompt, evaluate them via a sampler, and output the mathematical best-performing candidate.
using GoogleAdk.Optimization;
var optimizer = new SimplePromptOptimizer();
// The sampler simulates grading the effectiveness of the generated prompt candidates
var sampler = new LlmPromptSampler(judgeRunner, "Goal: Write a concise, friendly refund confirmation.");
var result = await optimizer.OptimizeAsync("Write a reply about a refund.", sampler);
Console.WriteLine($"Original Prompt: Write a reply about a refund.");
Console.WriteLine($"Optimized Prompt: {result.Optimized}");