Your coding agent already knows how to test your AI agent (we just turned it into a Skill)

**MyrinNew** · 03-23-2026, 02:46 PM

We’re adding something new at LangWatch: Skills.

And the idea is pretty simple:

your coding agent already knows how to do a lot of the work you’re still doing manually

You just haven’t packaged it properly yet.

The frustrating part of building AI agents

If you’ve built an LLM agent recently, you probably recognize this loop:

you tweak something
you run a few test conversations
it seems better
you ship it
something breaks in production

Then you repeat.

We’ve been there too.

It’s not that you don’t know you need evals, testing, or simulations.

It’s that doing all of that properly is… a lot.

The real work isn’t building, it’s validating

When we started LangWatch, we thought the main challenge was:

getting agents to behave correctly

But in practice, the bigger challenge was:

proving that they behave correctly

That means:

setting up eval datasets
writing tests
simulating real user behavior
instrumenting pipelines
understanding failures

And most of this ends up being:

manual
repetitive
easy to skip

The worst part: testing agents doesn’t look like testing code

Traditional testing breaks down with LLMs.

You can’t just say:

assert output == expected

Because agents are non-deterministic. The same input can give different outputs, which makes rigid testing fragile ([LangWatch][1]).

So what do people do instead?

They “vibe test”.

try a few examples
eyeball the results
hope nothing breaks

It doesn’t scale.

We already solved part of this (with agents testing agents)

If you’ve seen our earlier work (Scenario), you know we took a different approach:

use an agent to test your agent

Instead of fixed inputs/outputs, you:

simulate real user behavior
define success criteria
let an agent explore and evaluate

This makes testing much closer to reality.

But even then…

You still had to set everything up yourself.

So we asked: why are we still doing this manually?

At this point, most developers already have a coding agent open all day.

And those agents are actually pretty good at:

writing tests
structuring code
following instructions

So we started asking:

what if we let the coding agent handle the “quality work” too?

Not just writing features.

But:

setting up evals
creating simulations
instrumenting systems
analyzing behavior

That’s where Skills come in

We built LangWatch Skills as a way to give your coding agent reusable capabilities.

A Skill is basically:

a structured way to get your coding agent to do something correctly, every time

Not just:

“generate some code”

But:

“do this properly, following best practices, with full coverage”

What a Skill actually looks like

Under the hood, Skills are closer to:

structured instructions
workflows
examples
best practices

In general, agent skills are “instruction modules” that extend what an agent can do without retraining it ([philschmid.de][2]).

They tell the agent:

when to apply something
how to do it
what good looks like

What you can do with LangWatch Skills

With Skills, you can tell your coding agent to:

instrument your agent
generate evaluation notebooks
create simulation-based tests
explore production performance
red-team your system

And instead of figuring out how to do it…

…the agent just does it.

The shift is subtle, but important

Before:

you write eval code, tests, and infrastructure

After:

you review and guide what your agent generates

You move from:

implementation
to
coordination

And that’s actually where most of the value is.

This is part of a bigger shift: “harness engineering”

There’s a growing idea in the ecosystem:

the performance of your agent depends heavily on how you configure it

Not just the model.

But:

tools
context
memory
skills

These are all part of what some people call the agent “harness” — the system around the model that shapes its behavior ([humanlayer.dev][3]).

Skills are one of the most powerful (and underused) pieces of that.

But Skills aren’t magic

One important thing we’ve learned:

Skills don’t automatically fix everything.

In fact, a lot of skills:

don’t improve performance
or only help in specific contexts

Recent research shows many skills have limited impact unless they’re well-designed and properly evaluated ([arXiv][4]).

So the goal isn’t:

“add more skills”

It’s:

“add the right skills, and make them actually useful”

Why this matters now

We’re entering a phase where:

building agents is easy
making them reliable is not

The bottleneck has shifted.

And the teams that win won’t just be the ones who:

build faster

But the ones who:

validate better
iterate faster with confidence

What we’re aiming for

With Skills, the goal is simple:

reduce the amount of manual work required to build reliable AI systems

So instead of:

wiring pipelines
writing eval scaffolding
guessing what broke

You can:

delegate
review
improve

SWould love feedback

This is a new direction for us, and we’re still figuring out:

What makes a “good” Skill?
Where do Skills break down?
What should be automated vs controlled?

If you’re working on LLM agents, I’d love to hear:

how you’re handling evals today
what’s still painful
what you’ve tried that didn’t work

Try it out

If this resonates, you can check out what we’re building here:

👉 LangWatch Skills

Final thought

Your coding agent is already capable of doing much more than we typically ask of it.

Skills are just a way to unlock that.

The interesting question now is:

what else are we still doing manually that agents could handle better?

More...