Evaluating Harnesses
A harness that validates structurally may still fail in practice if descriptions are vague,HARNESS.md is too long, or skills overlap in confusing ways. This guide covers how to evaluate harness quality before deploying it to users.
Structural validation
Use theharnesses-ref CLI to check that your harness is structurally valid before testing behavior:
HARNESS.mdexists and has required frontmatter (name,description)- All items in
skills/are valid Agent Skills (containSKILL.mdwith required frontmatter) - Markdown files in
references/are recommended to have adescriptionin their frontmatter
- Skill grouping subdirectories missing a
SKILLS.mdsummary file - Reference grouping subdirectories missing a
REFERENCES.mdsummary file
Behavioral testing
Structural validity doesn’t guarantee the agent will use the harness correctly. Test behavior by running the agent with representative prompts and checking which skills it activates.Build a test prompt set
Write 10–20 prompts that represent real tasks the agent will receive. For each prompt, note which skill(s) you expect the agent to activate.| Prompt | Expected skill |
|---|---|
| ”Summarize this article for me” | summarize |
| ”Write a blog post about our new feature” | create-blog-post |
| ”What’s our brand voice?” | (reference lookup: brand-priorities.md) |
| “Generate a product image” | generate-images |
Check for activation errors
Run each prompt and observe what the agent activates:- Wrong skill activated — the description of the correct skill doesn’t match the user’s vocabulary; revise the description
- No skill activated — the task wasn’t covered by any skill, or the skill description was too narrow; expand it or add a new skill
- Multiple skills activated unnecessarily — skill descriptions overlap; tighten the scope of each
Check HARNESS.md length
If the agent is slow to respond or loses context mid-task,HARNESS.md may be too long. Count the approximate tokens in the body. As a rough guide:
- Under 300 tokens — ideal
- 300–600 tokens — acceptable for complex harnesses
- Over 600 tokens — consider reorganizing with subdirectories and shortening descriptions
Regression testing
Whenever you update a skill, add a new one, or editHARNESS.md, re-run your full test prompt set. Changes to one skill’s description can cause the agent to misroute prompts that previously worked.
Keep your test prompt set in version control alongside the harness.