Skip to content

Prompt Injection Mitigation Guide

Attack Type Classification

TypeObjectiveExampleDetection Indicator
Instruction OverrideRole Destruction"Ignore previous"Role Deviation Rate
Data ExtractionSecret Information Elicitation"Show system prompt"Forbidden Token Occurrence
Context PoisoningControl Flow AlterationMalicious Embedding InjectionAnomalous Similarity Score
Chain ReversalGuard EvasionMeta-instructions / Self-referenceDeviation Chain Length

Defense-in-Depth Model

  1. Input Sanitization (control tokens / URLs / Base64)
  2. Malicious candidate filtering via embedding similarity
  3. Two-stage generation (intent summary → approval → actual generation)
  4. Output policy filter (regex + classification model)
  5. Audit logging (prompthash, decisionreason)

Validation Metrics

MetricDefinitionTarget
False Positive RateBenign blocked< 3%
False Negative RateMalicious passed< 5%
Average Latency IncreaseDefense-added delay< 300ms

Next Actions

  • Automate test corpus generation script
  • Document malicious signature update procedures

Back to: index.md