skill-creator New Version - How Eval Pipelines Transform Skill Development¶
For / Key Points
For: Engineers developing and operating Skills with Claude Code / Claude.ai. Assumes familiarity with the previous skill-creator complete guide.
Key Points:
- skill-creator evolved into a full-cycle tool: create → evaluate → improve → benchmark
- Added subagent parallel execution, browser-based reviewer, and trigger optimization loops
- Driven by the industry-wide realization that "unverified context is worthless"
What Changed: Old vs. New Architecture¶
In short, the old version focused on "helping you write Skills." The new version extends to "verifying whether your Skill actually works, then improving it."
The skill-creator introduced in the previous article (January 2026) was a tool that auto-generated SKILL.md and directory structures through interactive dialogue. It included three helper scripts: init_skill.py (initialization), quick_validate.py (validation), and package_skill.py (packaging)1. Post-generation quality assurance was left to manual testing and the user's own judgment.
The new version fundamentally changes this structure. skill-creator's role expanded from "create Skills" to "create, test, measure, and improve Skills."
| Aspect | Old (Jan 2026) | New (Mar 2026) |
|---|---|---|
| Operation modes | Create only | Create / Eval / Improve / Benchmark (4 modes) |
| Test execution | Manual | Subagent parallel execution (with-skill / baseline) |
| Evaluation method | User's subjective judgment | Quantitative (assertion auto-grading) + Qualitative (browser viewer) |
| Specialized agents | None | Grader / Comparator / Analyzer (3 types)2 |
| Trigger optimization | None | Description Optimization (train/test split, max 5 iterations) |
| Script count | 3 | 8 (aggregate_benchmark.py, run_loop.py, run_eval.py, etc.) |
| Multi-environment | Claude Code only | Explicit branching for Claude Code / Claude.ai / Cowork |
The eval-viewer deserves special attention. The HTML viewer generated by eval-viewer/generate_review.py lets you review outputs per test case in a browser and enter feedback. It includes side-by-side comparison with previous iteration outputs, plus a benchmark tab for quantitative comparison of pass rates, token consumption, and execution time.
Why It Changed: The Unverified Context Problem¶
This evolution has an industry-wide backdrop.
Research from ETH Zurich (published February 2026)3 reported that developer-written context files improved agent task completion rates by only 4%. LLM-auto-generated context actually worsened performance by 3%, and both increased costs by over 20%.
The lesson isn't "don't write context." It's that "unverified context doesn't help." Skills face the same structural problem. Writing a SKILL.md that "seems to work" doesn't tell you whether the Skill is actually adding value.
Anthropic hasn't officially stated their motivation for embedding an eval pipeline into skill-creator. However, as Skills multiply and the ecosystem expands, without quality measurement tools, "using a Skill made things worse" becomes inevitable. This concern aligns with the challenges the research above identified.
Reading the New Version's Design Philosophy¶
The SKILL.md text includes not just eval procedures but detailed thinking about Skill improvement. A design philosophy absent from the previous skill-creator emerges here.
The "Explain WHY" Principle¶
The new version explicitly discourages littering Skills with uppercase ALWAYS and NEVER directives. Instead, it recommends explaining why each instruction matters. When the reasoning is clear, the LLM can make appropriate judgments beyond the literal instruction. Rigid rule lists break down in unexpected cases.
The "Generalize, Don't Overfit" Principle¶
Test cases use only a few examples, but a Skill may be executed millions of times. Over-optimizing for test results produces Skills that only work for specific cases. The new SKILL.md prompts you to always ask: "Can this feedback be generalized?"
Overfitting Prevention in Description Optimization¶
The trigger accuracy optimization loop also splits the eval set into train 60% / test 40%, selecting the best description by test score. The train/test split concept from software testing has been directly imported into Skill development.
These aren't technical procedures — they're a declaration of the philosophy of treating Skills as software.
Practical Impact: Who Benefits and How¶
The new version's benefits vary by environment.
Claude Code benefits the most. Subagent parallel testing, browser viewer, and Description Optimization are all available. For teams developing and distributing Skills, the eval system becomes a quality gate.
Claude.ai has constraints. Without subagents, tests run sequentially, and baseline comparison and blind A/B comparison are skipped. However, test case definition and qualitative review flows are available, enabling a more structured improvement loop than manual testing.
Cowork supports subagents but lacks browser display. eval-viewer handles this with static HTML output (the --static option).
| Environment | Parallel testing | Browser viewer | Description optimization | Blind comparison |
|---|---|---|---|---|
| Claude Code | Supported | Supported | Supported | Supported |
| Claude.ai | Not supported | Not supported | Not supported | Not supported |
| Cowork | Supported | Static HTML | Supported | Supported |
The new version's direction is important, but the implementation still has rough edges. For example, a GitHub Issue reports that run_eval.py fails to trigger Skills via claude -p, resulting in 0% evaluation rates4. Meanwhile, the issue where Description Optimizer required ANTHROPIC_API_KEY has been resolved by migrating to claude -p subprocesses5, and surrounding tools are being actively fixed.
Updating existing Skills is also supported. When you pass an existing Skill to skill-creator, it takes a snapshot and runs comparison tests using the old version as a baseline. You can quantitatively verify: "Did this get better than the previous version?"
Summary¶
The new skill-creator brought "measurement and improvement loops" to Skill development. Not just creation, but testing, measuring, and optimizing.
The basic structure introduced in the previous article — SKILL.md, directory layout, Progressive Disclosure, design principles — remains unchanged. What the new version added is a quality assurance layer on top.
As Skills transition from "write-and-forget" prompts to testable software artifacts, this shift may ripple across AI agent development practices as a whole.
Related Articles¶
- skill-creator Complete Guide (Previous Article) — Basic structure, design principles, and usage
- Claude Skills Complete Guide — Comprehensive Skills overview
- Claude Code Complete Guide — Claude Code fundamentals
References:
- anthropics/skills (GitHub) — skill-creator source code
- Agent Skills Official Documentation — Anthropic official reference
The old version's structure and each script's role are explained in the previous article. ↩
agents/grader.md(assertion grading),agents/comparator.md(blind A/B comparison),agents/analyzer.md(statistical pattern analysis) are bundled in the Skill directory. ↩arXiv:2602.11988 — ETH Zurich empirical study on the effectiveness of context files. ↩
Issue #556 —
run_eval.py0% trigger rate issue (Open as of March 7, 2026). ↩Commit b0cbd3d —
improve_description.pymigrated from Anthropic SDK toclaude -psubprocess (PR #547). ↩