FeaturesAI targets (MCP/Agent/RAG/Voice)

AI target scanning — MCP, Agent, RAG & Voice

The LLM red team treats a chat-completions endpoint as the asset. But an AI system is more than its chat endpoint: it has MCP servers, autonomous agents, retrieval pipelines, and voice front-ends — each with its own attack surface. Pencheff registers these as distinct target kinds, each with a dedicated scanner rather than a one-size-fits-all prompt suite.

In Register Target → AI & LLM Security you’ll see them as separate cards:

CardTarget.kindWhat it scans
LLM EndpointllmChat endpoints — see LLM red team
MCP ServermcpModel Context Protocol servers (HTTP SSE or stdio)
AI AgentagentAn AI agent via HTTP adapter or browser UI
RAG / Vector DBragRetrieval systems & vector databases
ML Model / Pipelineml_modelModel artifacts (static-only)
Voice / Speech AIvoiceSTT / TTS / voice-bot / voice-auth endpoints
Agent Memory / Vector StorememoryStored memory items — see Memory scanner

MCP and AI Agent are separate kinds: an MCP server exposes a tool manifest you enumerate and analyze statically; an AI agent is a conversational endpoint you probe adversarially. Each has its own consent action and scanner path.

Every dynamic (live-call) technique below is consent-gated — the operator’s written authorization and the relevant opt-in flag must be present before any tool is invoked, document is written, or audio is submitted.

MCP Server (kind = "mcp")

The MCP scanner connects to the server, enumerates its tools / resources / prompts into a normalized manifest, then runs:

Static analyzers (always-on, manifest-only — no tool calls):

TechniqueDetects
mcp:line-jumpingImperative/override instructions in tool descriptions (injected at tools/list, before any call)
mcp:hidden-contentZero-width / bidi / Unicode-tag characters smuggled into metadata
mcp:excessive-agencyDangerous capabilities (exec/shell/delete/payment) in tool surface
mcp:tool-shadowingDuplicate or privileged-builtin tool names
mcp:tool-poisoningInstruction-override phrasing in descriptions
mcp:rug-pullEmbedded/hidden-directive phrasing (mutable instruction content)
mcp:toxic-flowLethal trifecta — untrusted-input + private-data + egress tools together (CWE-441)
mcp:secrets-exposureAPI keys / tokens / private-key headers in resource contents or schema examples
mcp:over-broad-scopeWildcarded / dangerously-broad capability scopes
mcp:sensitive-resource.env, id_rsa, credential-shaped resource URIs

One analyzer raising never aborts the pass. Findings carry metadata.technique = "mcp:<name>".

Dynamic techniques (gated by dynamic_invocation):

  • mcp:param-injection — command / traversal / SSRF fuzzing of tool parameters.
  • mcp:result-injection — a unique marker that, if reflected back in a tool result, would inject the calling LLM.
  • mcp:destructive-no-consent — only when destructive_opt_in: a benign, clearly-marked call to a destructive-classified tool to detect execution without an elicitation/consent gate. Honors the allow/deny lists and max_calls budget.

Plus transport/auth CVE probes and a version fingerprint.

AI Agent (kind = "agent")

An agent endpoint is an LLM reached through a tool-using harness. Pick a source type:

  • agent_http — JSON request/response (with request_template / response_path for custom shapes); built-in providers include sema4_agent for the conversational create-conversation → post → poll protocol.
  • agent_browser — Playwright drives a chat UI via selectors.

The agent scanner runs a curated OWASP-LLM red-team pass through the shared runner — LLM01, LLM02, LLM04, LLM05, LLM06, LLM07 by default — with the mcp attack pack and jailbreak + crescendo strategies, plus lethal-trifecta (toxic-flow) static analysis over any advertised tools. Override the categories / strategies / payload budget per target:

kind_config:
  kind: agent
  source_type: agent_http
  provider: sema4_agent
  redteam:
    categories: [LLM01, LLM06]
    strategies: [jailbreak, crescendo]
    max_payloads: 36

Consent action: agent_probe (adversarial prompt probing — no data is modified on the target; only agent responses are captured).

RAG / Vector DB (kind = "rag")

Source types: managed_vdb, self_hosted_vdb, rag_endpoint, embedding_artifact.

Static analyzers (always-on): unauthenticated-DB exposure, missing tenant-isolation (cross-tenant leak), secrets-at-rest, vec2text invertibility risk, indirect-injection-at-rest (already poisoned docs in the KB), exfil-link-at-rest (markdown links with data-carrying URLs), retrieval-dominance (keyword-stuffing / PoisonedRAG relevance hijack).

Dynamic probes (gated by query_probes): datastore-extraction, membership-inference, cross-tenant canary. Poison probes (also gated by poison_injection_opt_in, self-cleaning): end-to-end KB-poisoning and retrieval-hijack (a doc engineered to dominate retrieval for unrelated queries). rag_endpoint targets additionally run a curated red-team pass (LLM01/02/04/09 default, overridable via kind_config.redteam) with the rag attack pack — poisoning, exfiltration (incl. markdown-link data-exfil), metadata-filter bypass, multi-chunk-split injection, reranker keyword-stuffing, source-attribution.

Voice / Speech AI (kind = "voice")

Source types: stt_endpoint, voice_bot, tts_endpoint, voice_auth.

Transport probes (always-on, best-effort): unauthenticated exposure, audio-URL SSRF (OAST), oversized/malformed-audio resource abuse.

Dynamic probes (gated by audio_probes):

  • voice:transcription-injection — cross-modal (audio) prompt injection, iterated over a curated payload set.
  • voice:ultrasonic-command — DolphinAttack-style inaudible commands.
  • voice:adversarial-audio — transcription instability under bounded perturbation.
  • voice:auth-spoof & voice:replay-attack (voice_auth) — synthetic speaker acceptance and byte-identical replay without nonce/liveness (CWE-294).
  • voice:ssml-injection & voice:ssml-parse-abuse (tts_endpoint) — external <audio src> SSRF and malformed-SSML resource abuse.

Note: the v1 audio transport submits a synthesized carrier, not real speech, so the cross-modal/ultrasonic/adversarial probes fire fully against text-accepting / LLM-backed voice endpoints and are dataset-ready for a real STT/TTS path; they flag fragility, not a proven spoken-word exploit.

Agentic finding validator

Scanners cast a wide net. After every AI scan (llm / mcp / agent / rag / voice) Pencheff automatically re-checks each finding and labels it genuine vs false positive — so a model that refused the attack isn’t reported as exploited.

Standard validation re-sends the finding’s recorded attack to the live target and an LLM judge reads the actual response:

  • The target performing / disclosing the unsafe action → genuine (true_positive).
  • The target refusing, deflecting, or not doing the unsafe thing → false positive.
  • Unclear / empty → inconclusive.

The judge uses the org’s configured LLM provider; when an org hasn’t set one, it falls back to Pencheff’s platform LLM, so validation always has a judge. Verdicts (with confidence + rationale + model) are recorded under the finding’s ai_triage.validation.

Deep Validate (manual trigger) drives the real-tools pentest agent to reproduce impact end-to-end; a working exploit returns genuine with a PoC, while no exploit is reported honestly as inconclusive (never “proof of safety”).

AI-validated false positives stay visible in the assessment, clearly labeled false positive (not hidden), and are excluded from the grade — they neither disappear nor hurt your score. DAST findings re-validate through the exploit recheck path instead of the LLM judge.

See also

  • LLM red team — the chat-endpoint scanner and the add-on attack packs (bias / RAG / MCP / coding-agent) that also load into LLM-endpoint scans.
  • Memory scannerkind="memory" stored-item secret + poisoning scan.
  • Custom LLM providers — configure the org judge / attacker model.