Skip to content

Input Data Masking / Sanitization Practical Guide

This guide organizes implementation patterns to reduce unnecessary exposure of personal information, sensitive data, and source code fragments before they reach AI agents/LLMs, and to block misuse/re-output/leak pathways.

🎯 Objectives

ObjectiveDescriptionSuccess Metrics
Minimal TransmissionSend only inference-essential fieldsAverage transmitted fields▼ / Redaction rate▲
De-identificationTokenization/hashing to prevent re-identificationRe-identification test success rate < 1%
Dynamic MaskingDifferential masking based on context/permissionsZero unauthorized displays
Output Re-exposure PreventionProhibit re-output of masked fieldsZero re-exposure detections

🔍 Risk Classification

RiskExamplesImpactMitigation
Direct PII TransmissionName, email, addressLeakage/re-outputHashing/tokenization
Auth Secret ContaminationAPI keys, tokensAuthentication abuseSecret detection + blocking
Code IP LeakageProprietary functionsCompetitive advantagePartial masking + summarization
Internal ID CorrelationSequential/UUID full transmissionInference attacksSurrogate key conversion
Easy-to-reverse HashingSHA1/MD5 aloneRe-identificationSalt + KDF

🧱 Architecture Layers

(Client) -> (Ingress Filter) -> (Masking Pipeline) -> (Policy Gate) -> (LLM)
                               |-> (Detokenization Service - scoped)
- Ingress Filter: MIME/size/binary rejection - Masking Pipeline: PII detection -> replacement -> token map storage - Policy Gate: Verification of permissions/purpose/data classification compliance - Detokenization: Least privilege + mandatory audit logs

🧪 PII/Secret Detection Rule Examples (Python)

import re
PII_PATTERNS = {
    'email': re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"),
    'phone': re.compile(r"\b\+?\d[\d -]{8,}\d\b"),
    'name_like': re.compile(r"\b[A-Z][a-z]+\s[A-Z][a-z]+\b"),
}
SECRET_PATTERNS = {
    'api_key': re.compile(r"sk-[A-Za-z0-9]{32,}"),
    'aws_key': re.compile(r"AKIA[0-9A-Z]{16}"),
}

def detect(text: str):
    findings = []
    for label, pat in {**PII_PATTERNS, **SECRET_PATTERNS}.items():
        for m in pat.finditer(text):
            findings.append({'type': label, 'span': m.span(), 'value': m.group(0)})
    return findings

🔐 Masking Strategies

StrategyMethodRe-identification RiskUse Cases
Fixed Token replacementLowGeneral conversation/summarization
Hash(SHA256+Salt)Display digestMedium (low frequency)Log traces
Format-Preserving MaskPartial retention(**1234)MediumUX display
Attribute GeneralizationAge→decadeLowStatistics/analysis
Synthetic Data ReplacementFaker generationLowestTesting/validation

🧬 Dynamic Policy Example (YAML)

version: 1
rules:
  - id: deny_raw_secret
    match: secret
    action: block
  - id: pii_email
    match: email
    action: mask_token
  - id: pii_name
    match: name_like
    action: generalize
  - id: code_block
    match: code
    action: summarize

🔁 Bidirectional Tokenization

token_map = {}

def tokenize(value: str) -> str:
    import secrets
    token = f"TKN_{secrets.token_hex(8)}"
    token_map[token] = value
    return token

def detokenize(token: str, actor_role: str) -> str:
    if actor_role != 'auditor':
        raise PermissionError('not allowed')
    return token_map.get(token, '')

✅ Validation Metrics

MetricMeasurement MethodTarget
Redaction Rate(% of detected PII with masking applied)> 98%
False Positive RateManual sample re-annotation< 5%
Reverse Lookup Success RateRainbow table attack test< 1%
Secret Leakage Re-output CountAudit log aggregation0
Latency Overheadp95 processing time comparison< +30ms

🚀 Implementation Steps

  1. Inventory current logs/input fields & classify sensitivity
  2. PII + Secret detection PoC (measure recall/precision)
  3. Tokenization + re-output prohibition rules (add regression tests)
  4. Detokenization API with least privilege + audit logging
  5. Gradual production rollout (Shadow → Enforce)
  6. Continuous evaluation: metrics dashboard & monthly reviews
  • Source Code Leak Prevention: ./source-code-leak-prevention.md
  • Audit Logging: ./audit-logging.md
  • Prompt Injection: ./prompt-injection.md

Back: ./index.md