Scorers
A scorer is the evaluation function that determines whether an AI output meets your quality criteria. LaunchGate supports 5 scorer types.
Choosing a scorer
| You want to check… | Use |
|---|---|
| Output matches an expected string exactly | exact_match |
| Output contains (or excludes) specific substrings | contains |
| Output matches a regex pattern (dates, IDs, free text) | regex |
| Output is valid JSON with a required structure, field types, numeric ranges, or array lengths | json_schema |
| Semantic judgement — faithfulness, relevance, safety, tone | llm_judge |
Prefer deterministic scorers (exact_match, contains, regex, json_schema) over llm_judge where possible — they are free, instant, and reproducible. Reach for llm_judge only when the check genuinely requires language understanding.
Do not use regex to validate JSON structure. Regex against stringified JSON breaks on key reordering, whitespace, and nested objects. Use json_schema instead — it is declarative, robust, and lets you express required fields, numeric ranges (minimum/maximum), and array lengths (minItems/maxItems) natively.
Scorer types
exact_match
Compares the output string to an expected value.
{
"type": "exact_match",
"config": {
"case_sensitive": true
}
}| Config | Type | Default | Description |
|---|---|---|---|
case_sensitive | boolean | true | Whether comparison is case-sensitive |
Score: 1.0 if match, 0.0 if not.
regex
Tests the output against a regular expression pattern.
{
"type": "regex",
"config": {
"pattern": "\\d{4}-\\d{2}-\\d{2}",
"flags": "i",
"should_match": true
}
}| Config | Type | Default | Description |
|---|---|---|---|
pattern | string | required | Regular expression pattern |
flags | string | "" | Regex flags (e.g., "i" for case-insensitive) |
should_match | boolean | true | Set to false to assert the pattern does NOT match |
Score: 1.0 if pattern matches (or doesn’t match when should_match: false), 0.0 otherwise.
json_schema
Validates that the output is valid JSON conforming to a JSON Schema.
{
"type": "json_schema",
"config": {
"schema": {
"type": "object",
"required": ["answer", "confidence"],
"properties": {
"answer": { "type": "string" },
"confidence": { "type": "number", "minimum": 0, "maximum": 1 }
}
}
}
}| Config | Type | Description |
|---|---|---|
schema | object | A valid JSON Schema definition |
Score: 1.0 if valid, 0.0 if invalid. The reason includes validation errors.
contains
Checks whether the output contains (or doesn’t contain) specific substrings.
{
"type": "contains",
"config": {
"values": ["source:", "reference"],
"mode": "any"
}
}| Config | Type | Default | Description |
|---|---|---|---|
values | string[] | required | Substrings to check for |
mode | string | "all" | "all" — must contain every value; "any" — at least one; "none" — must contain none |
Score: 1.0 if condition met, 0.0 otherwise.
llm_judge
Uses an external LLM to evaluate the output against a rubric you define.
{
"type": "llm_judge",
"config": {
"rubric": "Rate how faithfully the answer reflects the provided context. Score 1.0 if fully faithful, 0.0 if it contains hallucinated information.",
"model": "gpt-4o",
"_provider": "openai"
}
}| Config | Type | Default | Description |
|---|---|---|---|
rubric | string | required | Evaluation criteria for the LLM |
model | string | varies | Model to use (e.g., gpt-4o, claude-sonnet-4-20250514) |
scale | [number, number] | [0, 1] | Score range |
_provider | string | auto | Provider: openai, anthropic, google, azure_openai |
Score: A value between 0 and 1 as determined by the LLM judge.
LLM judge scorers require a BYOK key for the corresponding provider. Without one, the case will fail with a configuration error.
Cost and latency multiplier. Each llm_judge case in a suite adds one LLM API call per run, billed against your BYOK key. A suite with four judge cases run 100 times per day produces 400 extra LLM calls per day on top of your own application traffic. Budget accordingly and prefer deterministic scorers where the check allows.
Dual-sided scorers
Scorers can be configured as dual-sided, meaning they evaluate both precision and recall independently:
{
"dual_sided": true,
"precision_threshold": 0.8,
"recall_threshold": 0.7
}This is useful for cases where you want to measure both the accuracy and completeness of an output separately.
Scorer scope
Scorers are created at the project level and can be reused across multiple eval cases within that project.