Skip to content

Claude Code Complete Guide

skill-creator New Version - How Eval Pipelines Transform Skill Development

For / Key Points

For: Engineers developing and operating Skills with Claude Code / Claude.ai. Assumes familiarity with the previous skill-creator complete guide.

Key Points:

  • skill-creator evolved into a full-cycle tool: create → evaluate → improve → benchmark
  • Added subagent parallel execution, browser-based reviewer, and trigger optimization loops
  • Driven by the industry-wide realization that "unverified context is worthless"

What Changed: Old vs. New Architecture

In short, the old version focused on "helping you write Skills." The new version extends to "verifying whether your Skill actually works, then improving it."

The skill-creator introduced in the previous article (January 2026) was a tool that auto-generated SKILL.md and directory structures through interactive dialogue. It included three helper scripts: init_skill.py (initialization), quick_validate.py (validation), and package_skill.py (packaging)1. Post-generation quality assurance was left to manual testing and the user's own judgment.

The new version fundamentally changes this structure. skill-creator's role expanded from "create Skills" to "create, test, measure, and improve Skills."

AspectOld (Jan 2026)New (Mar 2026)
Operation modesCreate onlyCreate / Eval / Improve / Benchmark (4 modes)
Test executionManualSubagent parallel execution (with-skill / baseline)
Evaluation methodUser's subjective judgmentQuantitative (assertion auto-grading) + Qualitative (browser viewer)
Specialized agentsNoneGrader / Comparator / Analyzer (3 types)2
Trigger optimizationNoneDescription Optimization (train/test split, max 5 iterations)
Script count38 (aggregate_benchmark.py, run_loop.py, run_eval.py, etc.)
Multi-environmentClaude Code onlyExplicit branching for Claude Code / Claude.ai / Cowork

The eval-viewer deserves special attention. The HTML viewer generated by eval-viewer/generate_review.py lets you review outputs per test case in a browser and enter feedback. It includes side-by-side comparison with previous iteration outputs, plus a benchmark tab for quantitative comparison of pass rates, token consumption, and execution time.


Why It Changed: The Unverified Context Problem

This evolution has an industry-wide backdrop.

Research from ETH Zurich (published February 2026)3 reported that developer-written context files improved agent task completion rates by only 4%. LLM-auto-generated context actually worsened performance by 3%, and both increased costs by over 20%.

The lesson isn't "don't write context." It's that "unverified context doesn't help." Skills face the same structural problem. Writing a SKILL.md that "seems to work" doesn't tell you whether the Skill is actually adding value.

Anthropic hasn't officially stated their motivation for embedding an eval pipeline into skill-creator. However, as Skills multiply and the ecosystem expands, without quality measurement tools, "using a Skill made things worse" becomes inevitable. This concern aligns with the challenges the research above identified.


Reading the New Version's Design Philosophy

The SKILL.md text includes not just eval procedures but detailed thinking about Skill improvement. A design philosophy absent from the previous skill-creator emerges here.

The "Explain WHY" Principle

The new version explicitly discourages littering Skills with uppercase ALWAYS and NEVER directives. Instead, it recommends explaining why each instruction matters. When the reasoning is clear, the LLM can make appropriate judgments beyond the literal instruction. Rigid rule lists break down in unexpected cases.

The "Generalize, Don't Overfit" Principle

Test cases use only a few examples, but a Skill may be executed millions of times. Over-optimizing for test results produces Skills that only work for specific cases. The new SKILL.md prompts you to always ask: "Can this feedback be generalized?"

Overfitting Prevention in Description Optimization

The trigger accuracy optimization loop also splits the eval set into train 60% / test 40%, selecting the best description by test score. The train/test split concept from software testing has been directly imported into Skill development.

These aren't technical procedures — they're a declaration of the philosophy of treating Skills as software.


Practical Impact: Who Benefits and How

The new version's benefits vary by environment.

Claude Code benefits the most. Subagent parallel testing, browser viewer, and Description Optimization are all available. For teams developing and distributing Skills, the eval system becomes a quality gate.

Claude.ai has constraints. Without subagents, tests run sequentially, and baseline comparison and blind A/B comparison are skipped. However, test case definition and qualitative review flows are available, enabling a more structured improvement loop than manual testing.

Cowork supports subagents but lacks browser display. eval-viewer handles this with static HTML output (the --static option).

EnvironmentParallel testingBrowser viewerDescription optimizationBlind comparison
Claude CodeSupportedSupportedSupportedSupported
Claude.aiNot supportedNot supportedNot supportedNot supported
CoworkSupportedStatic HTMLSupportedSupported

The new version's direction is important, but the implementation still has rough edges. For example, a GitHub Issue reports that run_eval.py fails to trigger Skills via claude -p, resulting in 0% evaluation rates4. Meanwhile, the issue where Description Optimizer required ANTHROPIC_API_KEY has been resolved by migrating to claude -p subprocesses5, and surrounding tools are being actively fixed.

Updating existing Skills is also supported. When you pass an existing Skill to skill-creator, it takes a snapshot and runs comparison tests using the old version as a baseline. You can quantitatively verify: "Did this get better than the previous version?"


Summary

The new skill-creator brought "measurement and improvement loops" to Skill development. Not just creation, but testing, measuring, and optimizing.

The basic structure introduced in the previous article — SKILL.md, directory layout, Progressive Disclosure, design principles — remains unchanged. What the new version added is a quality assurance layer on top.

As Skills transition from "write-and-forget" prompts to testable software artifacts, this shift may ripple across AI agent development practices as a whole.


References:


  1. The old version's structure and each script's role are explained in the previous article

  2. agents/grader.md (assertion grading), agents/comparator.md (blind A/B comparison), agents/analyzer.md (statistical pattern analysis) are bundled in the Skill directory. 

  3. arXiv:2602.11988 — ETH Zurich empirical study on the effectiveness of context files. 

  4. Issue #556run_eval.py 0% trigger rate issue (Open as of March 7, 2026). 

  5. Commit b0cbd3dimprove_description.py migrated from Anthropic SDK to claude -p subprocess (PR #547).