RubricBasedEvaluator¶

The RubricBasedEvaluator is a built-in evaluation tool in the GoogleAdk.Evaluation library. It uses a large language model (an "LLM Judge") to evaluate an agent's response against a specific, natural-language grading rubric.

Overview¶

When building AI agents, it can be difficult to deterministically measure the quality of a response (like clarity, tone, or helpfulness). The RubricBasedEvaluator solves this by prompting an LLM with the user's input, the expected output (if available), the agent's actual output, and your specific instructions on how to score the interaction.

Usage¶

You configure the evaluator by providing a model for the judge, a metric name, and a string that contains the rubric.

using GoogleAdk.Core.Abstractions.Models;
using GoogleAdk.Evaluation.Evaluators;

var judgeModel = new LlmModel("gemini-2.5-flash");

var rubricEvaluator = new RubricBasedEvaluator(
    name: "HelpfulnessScore",
    judgeModel: judgeModel,
    rubricDescription: """
        Evaluate how helpful the ACTUAL OUTPUT is for the USER INPUT.
        - Score 1.0: Extremely helpful, directly answers the prompt, friendly tone.
        - Score 0.5: Somewhat helpful but misses details or has a flat tone.
        - Score 0.0: Unhelpful, irrelevant, or incorrect.
        """
);

Applying the Evaluator¶

Pass the configured evaluator into your LocalEvalService.EvaluateAsync method. The evaluator will automatically be applied to every EvalCase in the set.

var scoredResults = await evalService.EvaluateAsync(
    evalSet, 
    inferenceResults, 
    [rubricEvaluator]);

var firstCaseMetrics = scoredResults[0].Invocations[0].Metrics;
var helpfulnessResult = firstCaseMetrics["HelpfulnessScore"];

Console.WriteLine($"Score: {helpfulnessResult.Score}");
Console.WriteLine($"Reason: {helpfulnessResult.Reason}");

How It Works¶

Under the hood, RubricBasedEvaluator extends LlmAsJudgeEvaluator. It formulates a prompt containing: 1. Your Rubric 2. The User Input 3. The Expected Output (from your EvalCase.FinalResponse) 4. The Actual Output (the response generated by your agent during inference)

By default, the evaluator forces the judge model to output JSON with a "score" (float) and a "reason" (string), ensuring you get a reliable, parsable metric back.