Evaluating Harnesses

A harness that validates structurally may still fail in practice if descriptions are vague, HARNESS.md is too long, or content overlaps in confusing ways. This guide covers how to evaluate harness quality before deploying it to users.

Structural validation

Use the harnesses-ref CLI to check that your harness is structurally valid before testing behavior:

harnesses-ref validate ./my-harness

This checks:

HARNESS.md exists and has required frontmatter (name, description)
All skill leaves (directories detected as skills via .leaf-detectors) have valid frontmatter

It also emits warnings (non-fatal) for:

Grouping subdirectories missing a routing file — routing files enable progressive disclosure but are not required
Markdown content files missing a description in their frontmatter — descriptions help agents decide whether to load a file

Fix any structural errors before proceeding to behavioral testing. Warnings do not block validation but are worth addressing for larger harnesses where routing quality matters.

Behavioral testing

Structural validity doesn’t guarantee the agent will navigate the harness correctly. Test behavior by running the agent with representative prompts and checking which content it loads.

Build a test prompt set

Write 10–20 prompts that represent real tasks the agent will receive. For each prompt, note which skills or documents you expect the agent to load.

Prompt	Expected content
”Summarize this article for me”	`skills/summarize`
”Write a blog post about our new feature”	`skills/create-blog-post`
”What’s our brand voice?”	`brand/voice-guidelines.md`
”Generate a product image”	`skills/generate-images`

Check for routing errors

Run each prompt and observe what the agent loads:

Wrong content loaded — the description of the correct skill or document doesn’t match the user’s vocabulary; revise the description
Nothing loaded — the task wasn’t covered, or the description was too narrow; expand it or add new content
Too much loaded unnecessarily — descriptions overlap; tighten the scope of each

Check HARNESS.md length

If the agent is slow to respond or loses context mid-task, HARNESS.md may be too long. Count the approximate tokens in the body. As a rough guide:

Under 300 tokens — ideal
300–600 tokens — acceptable for complex harnesses
Over 600 tokens — consider reorganizing with subdirectories and shortening to routing file pointers

Regression testing

Whenever you update a skill, add new content, or edit HARNESS.md, re-run your full test prompt set. Changes to one description can cause the agent to misroute prompts that previously worked. Keep your test prompt set in version control alongside the harness.

Validation CLI reference

# Check structure
harnesses-ref validate ./my-harness

# Read a specific property from HARNESS.md
harnesses-ref read ./my-harness name
harnesses-ref read ./my-harness description

# Render the harness as prompt XML for inspection
harnesses-ref prompt ./my-harness

For harness creators

For client implementors

Evaluating Harnesses

Evaluating Harnesses

Structural validation

Behavioral testing

Build a test prompt set

Check for routing errors

Check HARNESS.md length

Regression testing

Validation CLI reference

​Evaluating Harnesses

​Structural validation

​Behavioral testing

​Build a test prompt set

​Check for routing errors

​Check HARNESS.md length

​Regression testing

​Validation CLI reference

Evaluating Harnesses

Structural validation

Behavioral testing

Build a test prompt set

Check for routing errors

Check HARNESS.md length

Regression testing

Validation CLI reference