Skip to main content

Evaluating Harnesses

A harness that validates structurally may still fail in practice if descriptions are vague, HARNESS.md is too long, or skills overlap in confusing ways. This guide covers how to evaluate harness quality before deploying it to users.

Structural validation

Use the harnesses-ref CLI to check that your harness is structurally valid before testing behavior:
harnesses-ref validate ./my-harness
This checks:
  • HARNESS.md exists and has required frontmatter (name, description)
  • All items in skills/ are valid Agent Skills (contain SKILL.md with required frontmatter)
  • Markdown files in references/ are recommended to have a description in their frontmatter
It also emits warnings (non-fatal) for:
  • Skill grouping subdirectories missing a SKILLS.md summary file
  • Reference grouping subdirectories missing a REFERENCES.md summary file
Fix any structural errors before proceeding to behavioral testing. Warnings do not block validation but are worth addressing for larger harnesses where routing quality matters.

Behavioral testing

Structural validity doesn’t guarantee the agent will use the harness correctly. Test behavior by running the agent with representative prompts and checking which skills it activates.

Build a test prompt set

Write 10–20 prompts that represent real tasks the agent will receive. For each prompt, note which skill(s) you expect the agent to activate.
PromptExpected skill
”Summarize this article for me”summarize
”Write a blog post about our new feature”create-blog-post
”What’s our brand voice?”(reference lookup: brand-priorities.md)
“Generate a product image”generate-images

Check for activation errors

Run each prompt and observe what the agent activates:
  • Wrong skill activated — the description of the correct skill doesn’t match the user’s vocabulary; revise the description
  • No skill activated — the task wasn’t covered by any skill, or the skill description was too narrow; expand it or add a new skill
  • Multiple skills activated unnecessarily — skill descriptions overlap; tighten the scope of each

Check HARNESS.md length

If the agent is slow to respond or loses context mid-task, HARNESS.md may be too long. Count the approximate tokens in the body. As a rough guide:
  • Under 300 tokens — ideal
  • 300–600 tokens — acceptable for complex harnesses
  • Over 600 tokens — consider reorganizing with subdirectories and shortening descriptions

Regression testing

Whenever you update a skill, add a new one, or edit HARNESS.md, re-run your full test prompt set. Changes to one skill’s description can cause the agent to misroute prompts that previously worked. Keep your test prompt set in version control alongside the harness.

Validation CLI reference

# Check structure
harnesses-ref validate ./my-harness

# Read a specific property from HARNESS.md
harnesses-ref read ./my-harness name
harnesses-ref read ./my-harness description

# Render the harness as prompt XML for inspection
harnesses-ref prompt ./my-harness