Your coding agent already knows how to test your AI agent (we just turned it into a Skill)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    Your coding agent already knows how to test your AI agent (we just turned it into a Skill)




    We’re adding something new at LangWatch: Skills.


    And the idea is pretty simple:


    your coding agent already knows how to do a lot of the work you’re still doing manually


    You just haven’t packaged it properly yet.


    The frustrating part of building AI agents

    If you’ve built an LLM agent recently, you probably recognize this loop:
    • you tweak something
    • you run a few test conversations
    • it seems better
    • you ship it
    • something breaks in production


    Then you repeat.


    We’ve been there too.


    It’s not that you don’t know you need evals, testing, or simulations.


    It’s that doing all of that properly is… a lot.


    The real work isn’t building, it’s validating

    When we started LangWatch, we thought the main challenge was:


    getting agents to behave correctly


    But in practice, the bigger challenge was:


    proving that they behave correctly


    That means:
    • setting up eval datasets
    • writing tests
    • simulating real user behavior
    • instrumenting pipelines
    • understanding failures


    And most of this ends up being:
    • manual
    • repetitive
    • easy to skip





    The worst part: testing agents doesn’t look like testing code

    Traditional testing breaks down with LLMs.


    You can’t just say:






    assert output == expected







    Because agents are non-deterministic. The same input can give different outputs, which makes rigid testing fragile ([LangWatch][1]).


    So what do people do instead?


    They “vibe test”.
    • try a few examples
    • eyeball the results
    • hope nothing breaks


    It doesn’t scale.





    We already solved part of this (with agents testing agents)

    If you’ve seen our earlier work (Scenario), you know we took a different approach:


    use an agent to test your agent


    Instead of fixed inputs/outputs, you:
    • simulate real user behavior
    • define success criteria
    • let an agent explore and evaluate


    This makes testing much closer to reality.


    But even then…


    You still had to set everything up yourself.





    So we asked: why are we still doing this manually?

    At this point, most developers already have a coding agent open all day.


    And those agents are actually pretty good at:
    • writing tests
    • structuring code
    • following instructions


    So we started asking:


    what if we let the coding agent handle the “quality work” too?


    Not just writing features.


    But:
    • setting up evals
    • creating simulations
    • instrumenting systems
    • analyzing behavior





    That’s where Skills come in

    We built LangWatch Skills as a way to give your coding agent reusable capabilities.


    A Skill is basically:


    a structured way to get your coding agent to do something correctly, every time


    Not just:


    “generate some code”


    But:


    “do this properly, following best practices, with full coverage”





    What a Skill actually looks like

    Under the hood, Skills are closer to:
    • structured instructions
    • workflows
    • examples
    • best practices


    In general, agent skills are “instruction modules” that extend what an agent can do without retraining it ([philschmid.de][2]).


    They tell the agent:
    • when to apply something
    • how to do it
    • what good looks like





    What you can do with LangWatch Skills

    With Skills, you can tell your coding agent to:
    • instrument your agent
    • generate evaluation notebooks
    • create simulation-based tests
    • explore production performance
    • red-team your system


    And instead of figuring out how to do it…


    …the agent just does it.





    The shift is subtle, but important

    Before:


    you write eval code, tests, and infrastructure


    After:


    you review and guide what your agent generates


    You move from:
    • implementation
      to
    • coordination


    And that’s actually where most of the value is.





    This is part of a bigger shift: “harness engineering”

    There’s a growing idea in the ecosystem:


    the performance of your agent depends heavily on how you configure it


    Not just the model.


    But:
    • tools
    • context
    • memory
    • skills


    These are all part of what some people call the agent “harness” — the system around the model that shapes its behavior ([humanlayer.dev][3]).


    Skills are one of the most powerful (and underused) pieces of that.





    But Skills aren’t magic

    One important thing we’ve learned:


    Skills don’t automatically fix everything.


    In fact, a lot of skills:
    • don’t improve performance
    • or only help in specific contexts


    Recent research shows many skills have limited impact unless they’re well-designed and properly evaluated ([arXiv][4]).


    So the goal isn’t:


    “add more skills”


    It’s:


    “add the right skills, and make them actually useful”





    Why this matters now

    We’re entering a phase where:
    • building agents is easy
    • making them reliable is not


    The bottleneck has shifted.


    And the teams that win won’t just be the ones who:
    • build faster


    But the ones who:
    • validate better
    • iterate faster with confidence





    What we’re aiming for

    With Skills, the goal is simple:


    reduce the amount of manual work required to build reliable AI systems


    So instead of:
    • wiring pipelines
    • writing eval scaffolding
    • guessing what broke


    You can:
    • delegate
    • review
    • improve





    SWould love feedback

    This is a new direction for us, and we’re still figuring out:
    • What makes a “good” Skill?
    • Where do Skills break down?
    • What should be automated vs controlled?


    If you’re working on LLM agents, I’d love to hear:
    • how you’re handling evals today
    • what’s still painful
    • what you’ve tried that didn’t work





    Try it out

    If this resonates, you can check out what we’re building here:


    👉 LangWatch Skills





    Final thought

    Your coding agent is already capable of doing much more than we typically ask of it.


    Skills are just a way to unlock that.


    The interesting question now is:


    what else are we still doing manually that agents could handle better?




    More...
Working...