Prompt Injection Mitigation Guide¶
Attack Type Classification¶
| Type | Objective | Example | Detection Indicator |
|---|---|---|---|
| Instruction Override | Role Destruction | "Ignore previous" | Role Deviation Rate |
| Data Extraction | Secret Information Elicitation | "Show system prompt" | Forbidden Token Occurrence |
| Context Poisoning | Control Flow Alteration | Malicious Embedding Injection | Anomalous Similarity Score |
| Chain Reversal | Guard Evasion | Meta-instructions / Self-reference | Deviation Chain Length |
Defense-in-Depth Model¶
- Input Sanitization (control tokens / URLs / Base64)
- Malicious candidate filtering via embedding similarity
- Two-stage generation (intent summary → approval → actual generation)
- Output policy filter (regex + classification model)
- Audit logging (prompthash, decisionreason)
Validation Metrics¶
| Metric | Definition | Target |
|---|---|---|
| False Positive Rate | Benign blocked | < 3% |
| False Negative Rate | Malicious passed | < 5% |
| Average Latency Increase | Defense-added delay | < 300ms |
Next Actions¶
- Automate test corpus generation script
- Document malicious signature update procedures
Back to: index.md