Skip to content

The Essence of Prompt Engineering — From Technique Collections to Design Thinking

Key Point: Moving from accidentally hitting the right prompts through word tweaking to reproducible design. This article shows the path for those who "understand prompt techniques" to advance to the next step (problem definition, context design, output contracts, lightweight validation).

Introduction: Why Graduate from "Technique Collections" (With Concrete Examples)

Recent popular prompt techniques like "drunk role-playing" or "deep thinking" feel effective and work in the short term. However, these are often "performance changes" rather than "essential improvements".

Limitations of Techniques

Instructing the model to "pretend to be drunk" produces a casual tone that seems "honest." But this is mimicking a drunk perspective, not extracting truth. The quality of truth and evidence hasn't improved.

Writing "think slowly" or "step-by-step" increases the volume of reasoning. However, lengthy explanations of wrong answers are still wrong. The "performance" of reasoning gets thicker, but without factual evidence or evaluation criteria, accuracy isn't guaranteed.

The tone improves, but with empty judgment criteria, it just becomes "sounding authoritative."

Conclusion: What readers and business require isn't "sounding authoritative" but answering with correct standards, evidence, and reproducibility. Therefore, we should shift weight from prompt wording to design (problem definition, information selection and arrangement, output contracts, lightweight validation). We'll focus on these four pillars going forward.

1. Why "Prompt Technique Collection" Dependence Plateaus

Techniques are useful as switches to change perspective or tone. However, when data selection, evaluation criteria, and evidence formats remain undesigned, you hit these limits:

Limitation 1 — No Reproducibility: Due to model updates, temperature, and position in long text, the same instruction gives different answers. What works in the moment cannot be reproduced for different projects, people, or times.

Limitation 2 — Accuracy gains are fragile without design: Techniques like "step-by-step," "Deep Thinking," "critically review" can increase reasoning exploration and boost average accuracy. However, without designing evidence source specification, reference boundaries, pass/fail criteria, it easily swings toward giving confidence to errors (lengthy, polite explanations of "wrong" answers).

Limitation 3 — Cannot withstand accountability: Business requires "Why can you say that?" Answers without sources, timestamps, IDs get stopped at review and audit stages.

Summary: Techniques are "tools for perspective change", not the main actor determining quality. Quality is determined by design.

2. The Remaining "Core" — Perspective, Boundaries, Evaluation (Why They Remain, Logically)

These three cores remain because they provide decision-making premises, not model "behavior."

2.1 Perspective — Which judgment axis to compare by

Models tend to average out "generally useful answers." Without declaring viewpoints (evaluation axes) first, what to prioritize becomes ambiguous and answers fluctuate.

What happens: The same facts get weighted differently by legal perspective (risk minimization), financial perspective (cost minimization), operational perspective (ease of operation).

Question: "Which overseas vendor annual contract should we choose?"

Answer: "Plan B is generally reasonable" (scattered reasoning)

Question: "Which overseas vendor annual contract should we choose?"

Answer: "Legal: Data transfer clause ✗ / Finance: Total cost ✓ / Operations: SLA △. Legal NG makes B impossible. Recommend alternative C"

Implementation Core: Specify by viewpoint labels rather than role names (e.g., [Perspectives] legal, finance, operations).

2.2 Boundaries (Scope/Constraint) — What to include, what to discard

Even in long contexts, unimportant fragments get "plausibly" mixed in. Narrowing the information visibility area by source, timestamp, authority reduces wrong answers.

What happens:

  • Limit sources to internal announcements v2025-07 onwards → Old wikis and external blogs are excluded from responses
  • Set timestamp to "after latest update date" → Avoid mistakenly adopting old system values
  • Restrict authority to "procurement portal public documents only" → Prevent accidental citation of internal memos

"Last year's internal blog mentioned a $50k limit..."

"Procurement announcement v2025-07 §4 limit $35k, exceptions in §4.2 (large contracts)"

Implementation Core: [Sources] include: policy_portal>=2025-07; exclude: wiki, blogs / Re-state important facts at beginning and end (mid-text burial countermeasure).

2.3 Evaluation — What to consider "correct"

Models maximize "plausibility." Without explicit pass/fail criteria, even lengthy explanations with errors tend to be treated as passing.

What happens:

  • No criteria → "Detailed explanations" increase but wrong evidence cannot be correctly identified
  • With criteria → Monitor success rate/critical error rate/evidence rate/editing time and immediately rollback to previous version when deteriorating

Output Contract Framework:

[Output Contract]
- Format: Table (item/value/evidence ID/update date)
- Evidence: Always include "announcement ID + clause number + date"
- Prohibited: Speculative assertions/unverified external information

Automated Evaluation (Minimal Example)

  • Format verification: Check for required keys (e.g., title, decision, evidence_ids[]) with JSON Schema/regex
  • Evidence verification: Cross-reference that evidence_ids actually exist in permitted sources within boundaries (detect 404/expired/version mismatches)
  • Pass/fail decision: Non-compliant rules marked as fail and regenerated. Cut off after n attempts → escalate to human review
  • Audit log: Save model_version / seed / temperature / source_hash (reproducibility, accountability)

Implementation Core: Evaluation first, generation second. At minimum, measure "success rate, critical errors, editing time, evidence rate."

Perspective = judgment axis, Boundaries = information safety fence, Evaluation = pass/fail decision. Even as models get smarter, "what to consider good" remains a human design domain.

3. "Smart Usage" Points for the Long-Text Era (Even with Large Context)

Long-Text Response Points:

  • Don't just put everything in and finish: Long texts tend to have strong beginning/end, weak middle. Rescue with restatement and markers of important facts.
  • Componentize evidence: Create fact collections with facts as short sentences + ID + timestamp, have main text written with ID references.
  • Stagewise: Search→Extract→Summarize→Contract (output/evidence)→Generate. Stages are more stable than single-shot batches.

Context expansion is beneficial, but selection, arrangement, evidence formats remain human design domains.

Appendix A: Consistency and Diversity Design — Combining temperature/seed with "Perspective, Boundaries, Evaluation"

Why do outputs fluctuate with the same prompt? Most LLMs use probabilistic sampling to choose the next token (nucleus/top-p, temperature, etc.). Therefore, "fluctuation" leads to idea expansion in creative tasks, but consistency is important for product reviews and regulation responses.

Control Knobs (Minimum Set)

Generation Parameters:

  • temperature: 0.0–0.3 for deterministic, 0.7–1.0 for diverse
  • top_p: 1.0 for broad, 0.8 etc. to trim tails
  • seed: Effective for pseudo-reproduction with same model version, input, parameters (not fully guaranteed)
  • n/candidates: Generate multiple candidates, select by evaluation criteria

Design Parameters (Quality-determining preambles):

  • Perspective: Declare scoring viewpoints (e.g., legal/finance/ops) by labels
  • Boundaries: Specify reference source inclusion/exclusion, timestamps, authority
  • Evaluation (Output Contract): Fix format, evidence format, prohibited items in 3 lines

Configuration by Use Case

PurposeRecommended Generation KnobsDesign Knobs (Required)Notes
Creative (novels, ideation)temp=0.8–1.0
top_p≈1.0
seed unspecified
n=3–5
Perspective: style/audience onlyPrioritize diversity. Selection by human/separate model.
AI Review (spec checking)temp=0.0–0.2
top_p=0.8–1.0
seed fixed
n=1
Perspective: legal/qa/ux
Boundaries: official docs≥date
Evaluation: pass/fail+evidence ID
Same criteria every time for judgment. Fix and record model/parameter versions.
FAQ Production Answerstemp=0.0–0.3
top_p≤0.9
seed fixed
n=1
Boundaries: announcement ID/version
Evaluation: table format+update date
Avoid time/external API dependencies. Cache/audit logs required.
Analysis Draft (template)temp=0.3–0.6
n=2–3
Perspective: management/field
Evaluation: outline contract
Human finishing required. Always keep evidence IDs.

Conclusion: Adjust "fluctuation" with generation knobs (temperature/top_p/seed) and fix "what to consider good" with design knobs (perspective, boundaries, evaluation) to target diversity for creativity, consistency for practical work.

4. Summary

The era of "magic words" is useful as an entry point but not the destination. What to polish next:

Next Steps — Four Pillars:

  1. Problem Definition (what, for whom)
  2. Context Design (selection, arrangement, evidence IDs)
  3. Output Contracts (format, evidence, prohibitions)
  4. Lightweight Validation (success/critical/editing time)

If you can articulate these 4 points in bullet format, value won't degrade even when models and tools change. Prompts become "componentized," and you become a designer. This is the next step.