Scorers

A scorer is the evaluation function that determines whether an AI output meets your quality criteria. LaunchGate supports 5 scorer types.

Choosing a scorer

You want to check…	Use
Output matches an expected string exactly	`exact_match`
Output contains (or excludes) specific substrings	`contains`
Output matches a regex pattern (dates, IDs, free text)	`regex`
Output is valid JSON with a required structure, field types, numeric ranges, or array lengths	`json_schema`
Semantic judgement — faithfulness, relevance, safety, tone	`llm_judge`

Prefer deterministic scorers (exact_match, contains, regex, json_schema) over llm_judge where possible — they are free, instant, and reproducible. Reach for llm_judge only when the check genuinely requires language understanding.

Do not use regex to validate JSON structure. Regex against stringified JSON breaks on key reordering, whitespace, and nested objects. Use json_schema instead — it is declarative, robust, and lets you express required fields, numeric ranges (minimum/maximum), and array lengths (minItems/maxItems) natively.

Scorer types

`exact_match`

Compares the output string to an expected value.


{
  "type": "exact_match",
  "config": {
    "case_sensitive": true
  }
}

Config	Type	Default	Description
`case_sensitive`	boolean	`true`	Whether comparison is case-sensitive

Score: 1.0 if match, 0.0 if not.

`regex`

Tests the output against a regular expression pattern.


{
  "type": "regex",
  "config": {
    "pattern": "\\d{4}-\\d{2}-\\d{2}",
    "flags": "i",
    "should_match": true
  }
}

Config	Type	Default	Description
`pattern`	string	required	Regular expression pattern
`flags`	string	`""`	Regex flags (e.g., `"i"` for case-insensitive)
`should_match`	boolean	`true`	Set to `false` to assert the pattern does NOT match

Score: 1.0 if pattern matches (or doesn’t match when should_match: false), 0.0 otherwise.

`json_schema`

Validates that the output is valid JSON conforming to a JSON Schema.


{
  "type": "json_schema",
  "config": {
    "schema": {
      "type": "object",
      "required": ["answer", "confidence"],
      "properties": {
        "answer": { "type": "string" },
        "confidence": { "type": "number", "minimum": 0, "maximum": 1 }
      }
    }
  }
}

Config	Type	Description
`schema`	object	A valid JSON Schema definition

Score: 1.0 if valid, 0.0 if invalid. The reason includes validation errors.

`contains`

Checks whether the output contains (or doesn’t contain) specific substrings.


{
  "type": "contains",
  "config": {
    "values": ["source:", "reference"],
    "mode": "any"
  }
}

Config	Type	Default	Description
`values`	string[]	required	Substrings to check for
`mode`	string	`"all"`	`"all"` — must contain every value; `"any"` — at least one; `"none"` — must contain none

Score: 1.0 if condition met, 0.0 otherwise.

`llm_judge`

Uses an external LLM to evaluate the output against a rubric you define.


{
  "type": "llm_judge",
  "config": {
    "rubric": "Rate how faithfully the answer reflects the provided context. Score 1.0 if fully faithful, 0.0 if it contains hallucinated information.",
    "model": "gpt-4o",
    "_provider": "openai"
  }
}

Config	Type	Default	Description
`rubric`	string	required	Evaluation criteria for the LLM
`model`	string	varies	Model to use (e.g., `gpt-4o`, `claude-sonnet-4-20250514`)
`scale`	[number, number]	`[0, 1]`	Score range
`_provider`	string	auto	Provider: `openai`, `anthropic`, `google`, `azure_openai`

Score: A value between 0 and 1 as determined by the LLM judge.

LLM judge scorers require a BYOK key for the corresponding provider. Without one, the case will fail with a configuration error.

Cost and latency multiplier. Each llm_judge case in a suite adds one LLM API call per run, billed against your BYOK key. A suite with four judge cases run 100 times per day produces 400 extra LLM calls per day on top of your own application traffic. Budget accordingly and prefer deterministic scorers where the check allows.

Dual-sided scorers

Scorers can be configured as dual-sided, meaning they evaluate both precision and recall independently:


{
  "dual_sided": true,
  "precision_threshold": 0.8,
  "recall_threshold": 0.7
}

This is useful for cases where you want to measure both the accuracy and completeness of an output separately.

Scorer scope

Scorers are created at the project level and can be reused across multiple eval cases within that project.