Tutorial: LLM A/B regression gate
Use Pencheff’s --compare-to flag and the scan-comparison
endpoint to gate model upgrades on safety regressions. Same prompts,
two models, one diff.
Scenario
- Today. Production runs
acme-copilot-prod(Llama 3.3 70B). - Tomorrow. A PR proposes a swap to
acme-copilot-next(Mixtral 8x22B). The eval team says it’s smarter; the safety team wants proof it’s not less safe. - Goal. A side-by-side diff that fails the merge if any new failure techniques appear.
1. Baseline scan against the current model
pencheff llm-redteam \
--target https://copilot.acme.com/api/chat/completions \
--provider openai-chat \
--model acme-copilot-prod \
--header "Authorization=Bearer $COPILOT_TOKEN" \
--profile standard \
--strategies 'base64,jailbreak,crescendo,leetspeak' \
--datasets 'donotanswer,harmbench' \
--judge-provider openai-moderation \
--judge-endpoint https://api.openai.com/v1/moderations \
--output-format json \
--output-file baseline.jsonPersist baseline.json — this is the source of truth for
“the current model passes today.” Re-run the baseline
only when the system prompt or guardrails change.
2. Candidate scan with --compare-to
The CLI computes the regression diff inline when you point it at
the previous run with --compare-to:
pencheff llm-redteam \
--target https://copilot.acme.com/api/chat/completions \
--provider openai-chat \
--model acme-copilot-next \
--header "Authorization=Bearer $COPILOT_TOKEN" \
--profile standard \
--strategies 'base64,jailbreak,crescendo,leetspeak' \
--datasets 'donotanswer,harmbench' \
--judge-provider openai-moderation \
--judge-endpoint https://api.openai.com/v1/moderations \
--compare-to baseline.json \
--output-format json \
--output-file candidate.json \
--fail-on highThe output JSON now carries a regression block alongside the
candidate’s findings:
{
"regression": {
"regressions": [
{"owasp": "LLM01", "technique": "direct_override",
"baseline_failures": 2, "candidate_failures": 7}
],
"fixes": [
{"owasp": "LLM05", "technique": "xss_markdown",
"baseline_failures": 4, "candidate_failures": 0}
],
"common_failures": [
{"owasp": "LLM07", "technique": "completion_shotgun",
"baseline_failures": 3, "candidate_failures": 3}
]
}
}3. Or compare from the API
When both scans landed on the SaaS dashboard, pull the diff from the public endpoint:
curl -s -H "Authorization: Bearer $PENCHEFF_API_KEY" \
"$PENCHEFF_API_BASE/scans/$BASELINE_SCAN_ID/compare/$CANDIDATE_SCAN_ID" \
| jqThe web UI exposes the same diff at
/scans/compare?a=$BASELINE&b=$CANDIDATE.
4. Gate the merge
The CLI’s --fail-on high already exits non-zero on any
HIGH+ candidate finding. Layer a regression check on top with
plain jq:
test "$(jq '.regression.regressions | length' candidate.json)" -eq 0 || {
echo "::error::LLM safety regressions detected"
exit 1
}5. Share-by-link to the safety reviewer
curl -X POST -H "Authorization: Bearer $PENCHEFF_API_KEY" \
"$PENCHEFF_API_BASE/scans/$CANDIDATE_SCAN_ID/share?ttl_seconds=604800"Returns a Fernet-encrypted token. The public route
GET /share/llm/{token} renders the report as HTML / Markdown /
CSV / JSON without auth — token expiry is the only revocation.
Why share-by-link instead of an account invite? Safety reviewers are usually external and short-engagement. The token is scoped to one scan, expires automatically, and revoking is a single endpoint.
Deliverable
- A CI gate the platform team trusts (no regressions → merge).
- A share-link the safety reviewer can open without an account.
- The candidate scan’s compliance rollup on the LLM framework set, ready to drop into the AI risk register.
Next
- Compliance overview — the AI framework deep-dives.
- Grafana dashboard — trend the failure rate over time.