The past few years have been marked by an obsession with language models. We’ve talked about them a great deal here on Software and Synapses, discussing their effects on the industry, and mental models for understanding. Throughout these essays, a common theme has emerged: language models’ core accomplishment is an automation of approximate reasoning.
Our pre-existing symbolic systems could perform complex functions, but ultimately only did what we built them to do. The scope of a system was inherently limited to its predesigned purpose. That works out pretty well, all things considered. You build bridges out of steel beams, after all. However, flexibility and rigidity go hand-in-hand in the building of a successful tool.
That’s where language models have proved themselves useful, as they allow for the flexibility of human language to be baked into an automated system. Trained on massive corpuses of human text, when you give a language model a prompt, it spits out the most likely next sequence of tokens based on learned statistical relationships. In other words, you can get an approximately human response to a given prompt, from a computer, in a few seconds.
Naturally, this new type of system, which some have called “reasoning engines”, has given rise to new waves of application paradigm. One such paradigm is called the “agent”. That’s not to say that agents are new, robots have been around since at least the 1960s, and research into artificial intelligence saw massive growth in the 1980s. However, with a more abstract, approximately-human type of reasoning accessible via API, agents have become far more imaginative.
In this series, I want to get hands on with autonomous agents. I even made one. A while back, I wrote an article on the Anatomy of an Autonomous Agent (please remind me to re-do the voiceover on that). The point of the post was to conceptualize an abstract framework for autonomous agents, at least to some degree, and get an idea of how they work. However, it wasn’t until a couple of weeks ago that I finally took that framework and made it with code. I’m calling it Simple Agent, or Simmy, and here’s a little demo.
The goal of the simple agent project is multi-faceted. First, I wanted to actually build the conceptual framework that I laid out before. Second, I want to use it as a template for more specialized agents. Third, I want to use it to test different ideas, and experiment. If I want to try a new approach to something, perhaps a new prompt, I can just update it and see if it improves the system. However, this use-case introduces a crucial question: how do you know if the system has improved?
One improvement this publication would be a subscription from you. If you haven’t already, consider dropping your email below :)
The Problem
With traditional applications, there is a certain range of expected behavior. If an application is slow, you can make updates, and if the application’s speed improves, you’ll know that the updates worked. If there’s a bug, you push a change, and if the bug is gone, then you can consider the mission successful. This is how you iteratively improve traditional products. For our mechanistic applications, testing is as simple as making sure everything’s in the right place, and that data flows in the expected way.
Additionally, when it comes more specifically to performance, a traditional application may be measured on various criteria. Whether that be time taken to complete a task, computational resources required, or other measures of efficiency. These criteria are mechanistic and easy to encapsulate with tests. Performance can be measured over time, and comparisons between versions can be made.
For agents, on the other hand, performance is a bit more abstract. Expected behavior itself is difficult to solidify. Of course, an agent’s underlying codebase still benefits greatly from unit tests and integration tests, since at its heart it’s a deterministic program. Simple performance metrics can also be useful. However, the core function of the agent is significantly more complex. A bug is usually classified as unexpected or unwanted behavior by the program, but what constitutes a bug for an agent? What makes a behavior unexpected or unwanted?
When an agent is instructed to complete a task, the tree of possible actions that it might take, grows very quickly. Take the above example, where we have a single prompt, followed by two similar but ultimately different approaches to the same problem. This practical non-determinism makes it a bit more tricky to measure whether behavior is correct or not. Simply creating a test for every single task that an agent might be instructed to perform, is impractical. Even if it weren’t, the contents of such a test aren’t apparent either.
Once we get into such landscapes of complexity, we shift away from tests of the placement and flow of data, and towards benchmarks. A benchmark is a more substantive test, which measures a given subject’s performance on various meaningful criteria. Benchmarks have become the norm for objective evaluation of language models, and agents as it stands. So, our goal should be to develop and use an objective benchmarking system to measure improvements to our agents.
This way, building out an agent becomes a task which can be done iteratively, with performance tracked over time. Updates either improve or deteriorate the performance of the agent on the benchmark. The benchmark acts as the bottom line, measuring the actual utility of the agent based on its current state. In a way, a benchmark is an objective measure of the agent’s difference from the project requirements. It guides the project towards completion.
However, saying an agent should be benchmarked is like saying an application should be tested. First, it probably won’t actually happen as often as it should. Second, the benchmark must be specifically designed for the agent. Frameworks exist for testing or profiling, but they have to be specifically tuned to an individual application through adapters. Even then, they may not be comprehensive, valuing general applicability over specificity.
The devil is in the details. However, we’ve identified our problem and our goal. Before we can really get started building an agent, we need to figure out how to benchmark it.
Some Precedent
I am not the first person to think about this, by a long shot. Smarter and more experienced people than I have tried their hand at this issue, and achieved great success. So, let’s take a look at some precedent.
AutoGPT
Early on in the agent game, came AutoGPT. Released in March 2023, AutoGPT debuted early in the AI boom, and solidified itself as a capable system for agent development. The project aims to be a platform for the creation and development of intelligent agents, built and fine-tuned to your specific use-case. In service of this goal, the AutoGPT team built an early benchmark for comparison of agents.
The benchmark, available here, tests the performance of agents on various chosen criteria, including, but not limited to:
Interface: How accessible is interfacing with the agent?
Code Generation: How well does it generate new code?
Code Modification: How well does it modify existing code?
Memory: How accurately and usefully does the agent store memory?
Information Retrieval: How well can the agent retrieve data from a source?
Planning: How well does the agent make plans?
Safety: How well does the agent stick to guard rails?
So, of these, we can establish a couple of patterns.
First, the meaningful criteria that the AutoGPT benchmark includes are heavily geared towards a task-based, coding agent. While I am skeptical of purely task-based agents, and I think they should be a bit more broad in their design, this is still excellent in terms of inspiration. Simple Agent is very task based, which is helpful for creating continuity between actions taken by the agent.
I think that for our specific use-case, with simple agent, since it’s designed to be general-purpose, it’s hard to say that code generation and modification is directly applicable. These are better applied to other projects, such as Universal Constructor.
Measuring the interface, on the other hand, is an especially critical inclusion. Again, I’m skeptical of the so-called “Agent Protocol”, and other purely-task-based systems, but I do think that interfacing with the agent is important. If we’re going to be comparing improvements over time, then the communication surface of the agent should be tracked as well.
Memory and information retrieval are also broadly applicable, and useful for Simple Agent. Now, we’ll have to discuss ideas regarding agent memory in future installments, but for today’s purposes, memory is crucial for building the perception of an agent.
Memory is the collection of meaningful data points which might be useful to reference in the future. An agent’s ability to create and draw from this source is, I think, going to be one of the greater influences on the performance of an agent. We’ll want to track this somehow.
Machiavelli
Machiavelli is a fascinating, proposed benchmark, which uses choose-your-own-adventure games to test the ethics of an agent. The idea is sound, and the concept is likely applicable to other test sets. Essentially, by using the choose-your-own adventure style as a template, one can track the decision-making, and therefore reasoning, abilities of an agent in a more controlled environment.
“A mock-up of a game in the MACHIAVELLI benchmark, a suite of text-based reinforcement learning environments. Each environment is a text-based story. At each step, the agent observes the scene and a list of possible actions; it selects an action from the list. The agent receives rewards for completing achievements. Using dense annotations of our environment, we construct a behavioral report of the agent and measure the trade-off between rewards and ethical behavior.”
The idea of the Machiavelli benchmark provides an excellent template for evaluation of decision-making and task flow. In this specific benchmark, ethics are tested. That’s really important… but, not too important for our specific use-case. We want to test the performance of the agent on other criteria. However, we should absolutely keep in mind the structure of the Machiavelli project as a way to potentially conduct those tests.
In the aforementioned Anatomy of an Atonomous Agentpost, one key aspect was the idea of an agent’s perception, or input to the reasoning engine. Essentially, an agent takes in environmental stimuli, along with memory, and a goal, and uses that to determine its next action. With that being the case, then the step-by-step evaluation of decision-making proposed by the Machiavelli benchmark would allow us to test various criteria, per decision.
Of course, in the real world, agents likely won’t always be confined to a series of pre-selected decisions. In order to address this problem, we have the next inspiration.
𝜏-bench
𝜏-bench is a benchmark which aims to simulate real-life situations for agents. It is built around the idea that agents are meant to interact with real users, and real tools. Now, in the case of the benchmark, it simulates these tools and users, rather than have real people interact with agents all the time. But, the LLM simulations allow evaluation of agents in real-world situations, following domain-specific policies and tooling.
“Example of an airline reservation agent in 𝜏-bench. If a user wants to change their flight reservation to a different destination airport, the agent needs to gather all the required information by interacting with the user, check the airline policies using the guidelines provided, find new flights and (if possible) rebook for the user using complex airline reservation APIs.”
I really like the straightforward and practical style of this benchmark. If agents are going to be out there acting in the wild, why not try to replicate that? Of course, I’m not sure that this benchmark addresses the problem of granular decision-making on various steps. I’d need to poke around a bit further to know for sure. However, goal-based evaluation doesn’t work to address the specific steps that an agent takes, reasoning its way towards completing a task.
With that said, I wonder how much that actually matters. If an agent is built to complete tasks, I suppose as long as it does that, it’s not a huge deal. On the other hand, efficiency is important. In the classic saying “make it work, make it better, make it fast”, there is a “make it fast” part. So, measuring efficiency in task-completion is still crucial. I could potentially see a combination of these various approaches to benchmarking being ideal.
Dimensions of Meaning
So what meaningful criteria ought we to test in the first place? There’s certainly some inspiration above, but we should nail down what exactly it is that we hope agents to do. What do we care about here? More specifically, let’s go through some criteria upon which versions of Simple Agent will be tested, in order to prove its effectiveness over time.
Tools
First off, an agent is not much more than a chatbot without its tools. Tools are used to interact with and affect the environment. As such, testing an agent’s ability to select and use tools is critical. Given a task and a current state of the environment, an agent must decide which action to take, then take that action. With that said, we can introduce two meaningful criteria to our imaginary benchmark:
Tool Selection: How accurately does the agent choose the right tool for the job?
Tool Use: After selection, how accurately does the agent actually use the tool?
Reasoning
Next up is the whole point of the agent itself, and likely its most significant bottleneck. An agent must reason. Reasoning can essentially be understood as the logical steps that an agent takes in making decisive action. So, an agent reasons its way to a decision, and then acts upon it. With that said, reasoning extends beyond tool-use specifically, and towards plan formation and understanding. Additionally, reasoning is what allows for efficiency, as each step towards task completion is compounding.
Here are some valuable metrics for us to consider.
Task Completion: Is the task actually completed?
Plan Formation: How well does the agent form specific plans?
Steps Required: How many steps were required to complete the task?
Task Understanding: How well did the agent understand the task in the first place?
Redundancy: How often were steps repeated when they didn’t need to be?
Assumptions: Does the model make assumptions about tasks which could lead to accuracy deficit?
Assumption Rate: How often are assumptions made?
Assumption Accuracy: How often are assumptions correct?
Of course, reasoning is a broad term, and much more can be evaluated here. However, there is a significant amount of room for growth, and we have to start somewhere.
Architectural Integrity
The architecture of the agent itself is important. However, agents like Simple Agent are designed to be modular, and so components are often hot-swapped. A new embeddings model might be chosen for memory. One might switch the language model to the latest one. However, the underlying architecture of the model matters too, and one way to test the robustness of it is to test the agent in various configurations and measure consistency.
So, here are some robustness metrics:
Inter-LLM Ability: How consistently does the agent perform when the underlying reasoning engine is changed?
Inter-Embeddings Ability: How consistently does the agent perform when the underlying embeddings model is changed?
Inter-Toolset Ability: How consistently does the agent perform when the specific tools it has access to are changed?
These, and other metrics like them, would allow the performance of an agent’s architecture itself to be evaluated. For something like Simple Agent, this is important, because if it’s meant to act as a template for others, then they may be concerned with the quality of it even when they make changes.
Code
Next up, I know I said earlier that code wasn’t super important to the Simple Agent project, since it’s designed to be generally-applicable. However, as a developer, a lot of my use-cases end up being code-related, so it’s worth talking about some relevant evaluation metrics.
Code Quality: A composite of various evaluations of outputted code.
Code Optimization: How well can the agent optimize existing code?
Testing: How well can the agent test its code?
Test Accuracy: How often do the tests fail or pass when they shouldn’t?
These evaluation metrics, while not entirely comprehensive, will hopefully act as a good starting point towards our next steps of agent evaluation and benchmarking.
Next Steps
Now that we’ve got a pretty good theoretical framework of how to deal with non-determinism in performance profiling, the next step is to put that idea to work. More specifically, in the next installment of this series, we’re going to try our hand at actually nailing down specific tests that we can run against the agent, testing them manually, and then seeing about potentially automating the process.
Author’s Note
This was actually really fun to write! I’ll be honest, I’ve gotten a bit of writer’s block lately and I wasn’t sure what to write about. However, one of my biggest areas of interest is autonomous agents, and so I’ve been working on those a bit lately on the side. Then I realized I can talk about them here! So, this series, All About Agents, is going to be an ongoing collection of deep dives specifically regarding agents.
Later on, I’ll probably do hands-on tutorials on my YouTube channel going over how I’m actually building Simple Agent and such. However, for now, I think it’s best to get down a really solid theoretical basis for later development. Having objective evaluation criteria and processes is a necessary step in that direction. It’s also super interesting!
Let me know your thoughts in the replies below, or over on Substack Notes! What do you think we should focus on when evaluating agentic systems? Anyway, as always, thank you so much for reading, and I’ll see you next time! Goodbye.