Teamwork makes the dream work for the myriad specialized AI agents that may soon be joining offices everywhere.
One of the key questions driving Ece Kamar’s research as managing director of Microsoft’s AI Frontiers Lab is how to coordinate networks of these agents—AI systems that can perform autonomous tasks beyond the scope of chatbots. Late last year, her lab developed AutoGen, a popular open-source tool for creating multi-agent networks, and Microsoft turned it into a low-code studio for businesses earlier this year.
But it’ll likely take a lot more than that for businesses to be comfortable handing over swaths of their operations to fully autonomous systems. Indeed, Kamar, whose 2010 doctorate focused on human-AI collaboration as well as multi-agent systems, said human oversight and accountability will be key.
With the concept of agents ubiquitous on 2025 AI prediction lists, we spoke with Kamar about the future of this tech, the safeguards needed, and the year ahead in AI research.
This interview has been lightly edited for length and clarity.
What needs to happen for adoption of multi-agent systems to become widespread, and what do you see as the time frame for that?
We released AutoGen a little more than a year ago, and it quickly became one of the most popular libraries in the agentic space, because the fact that you could use multiple agents to carry out a complex task turned out to be a really powerful programming paradigm for building agentic systems. So what I mean by that is, imagine you have a difficult problem to solve that an LLM cannot do alone. If I can have an agent—we call it an orchestrator—that’s kind of the reasoner agent, the problem-solver agent that can take that [problem] and divide it into steps, and then identify which existing agent has the right capabilities to do that part of the task and delegate and bring the pieces together, that turns out to be a really useful way of doing more with these agents. So that has been a fundamental part of many organizations building agents right now, and that’s why our library called AutoGen has been actually really popular, both in industry and in academic circles.
And what we are seeing right now is that some of those reasoning capabilities that we do with multi-agent orchestration now can be done by models like [OpenAI] o1 by default, because they have this inference time, problem-solving abilities, which is really great to see, because this pushes the capabilities of these systems forward all the way up. But I personally don’t think multi-agent approaches and these agents are going to go away, because what we are seeing with our work with AutoGen, working with many partners in the open-source community, is that now agents are becoming persistent entities. Many people, many organizations are creating agents to represent different capabilities and expertise in their organization. And then this need for orchestrating work across these multiple experts is becoming a thing. Each of those experts are powered by some strong models on the backend, but they carry out these roles, and I think that will be a really interesting thing to shape up in the next few years as well. Like, are we going to have new ecosystems, marketplaces of these agents emerging in the world? I think that’s something we are getting to right now.
Getting to your question about, what are some of the blockers of this? One of the blockers is performance and capabilities and reliability. So in my organization, we are using agentic benchmarks to measure where we are as a community in terms of powering these agents to carry out many tasks that humans would like to do with them. And we are seeing that even the best systems we are building right now can do, let’s say, 30% to 35% of the tasks we throw at them in reliable ways, whereas this number is around 85% for a person.
So there is still this capability gap that we need to bridge to make these agents a lot more reliable in terms of doing many different tasks. And I expect in the next year, there are going to be innovations, both on models’ memory and learning that will start closing this gap.
The second limitation is safety. If I’m telling this agent to go buy tickets for me, book travel for me, I really need to make sure that when that agent buys a ticket for me, the agent is buying the right ticket. So there’s still quite a bit we need to figure out to make sure that when these agents are taking these actions on our behalf, they are doing it safely, they are doing it the way we want them to, and then—especially when we get into these ecosystem and marketplace questions that we were just talking about—there’s a question about authenticity, validation, how agents share information across trust and information boundaries and such. So there are just all of these questions that are now coming up that are super interesting for the computer science community around trust, around distributed work, around just creating ecosystems that we have good control over. And all of those pieces will need to come together for these agents to become a reality in the world, rather than a research subject.
Keep up with the innovative tech transforming business
Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.
In terms of safety, it seems like a lot of companies are still hesitant to trust even chatbots with certain things—there’s a lot of concern around hallucinations. What do you think it could take to get businesses to the point where they feel like they can trust an agent to act on their behalf?
I believe we will need to invest in a number of key technology pieces and bring them together to reach the trust levels that you’re talking about. One of them is human accountability and human oversight. For all AI systems we build, unless these systems reach almost a perfect performance level, we have to have the humans in charge overseeing these systems and really understanding how these systems work and be able to correct them as needed. In fact, one of the things we are doing in my organization is also investing in this interface layer between the agents and people, where people understand what these agents are doing. They have ways of validating, correcting, and accepting the actions of the agents, and that is a core requirement for the work we do. Unless these agents get to that maturity level you’re talking about, that human control is paramount, and it cannot be overridden by AI agents.
The second piece that needs to come together is that these agents should have a memory, and they should learn. Let’s say I’ve already done this task with these agents. The next time the same task comes over, the agents should utilize the feedback I have given to that agent in the past and do it in the way I expect it to, because that’s how we develop trust. We develop trust by observing that the systems we are working with are working the way we expect them to, and that can only happen through these systems having a memory and a learning loop with the users. So those are some of the technology pieces. People are so focused on just the models and how good the models are getting, but in my opinion, for these systems to become a reality in the world, to become trusted realities in the world, the combination of these approaches, including how we interact with them, needs to be developed as well.
There’s been a lot of talk recently about whether foundation models are hitting a wall, at least in terms of pre-training and the amount of data needed to scale them. Do you expect the progress to slow in terms of these big foundation models in the coming year?
In my organization, we are seeing the foundation models only as a component of the larger system that matters for the performance of the system we develop. And of course, any advances we have on the foundation model side are very useful. It pushes the whole stack up.
But it’s not the only thing we invest in. So for example, when we came up with the idea of AutoGen, the multi-agent orchestration, one of the things we observed is that using multiple agents in a decentralized way can overcome some of the limitations of the transformer architecture. So these ideas can be complementary. And this is also happening now in the modeling space, places like OpenAI and Anthropic. They are going for scale, of course, and they are going to be releasing larger and larger models, but they are also putting efforts into inference time, reasoning, because everybody is seeing that there are going to be these complementary approaches that will be fueling the innovation forward. So I personally don’t expect the innovation in the space to slow down. I don’t have any data whether scaling is going to bring diminishing returns or not, but I think there are so many exciting directions that will be added on top of the scale that as a field, we are not going to be slowing down in terms of the capabilities.
Do you expect to see a lot more specialized LLMs in the next year?
We are going to be seeing a lot more specialization in the models, mainly because, especially when it comes to agents, not every agent needs to have the exact same capabilities. If I’m building an agent to just interface with the web, that agent needs to have very different capabilities than an agent that’s assisting doctors with medical questions. And if I’m trying to power these agents with exactly the same models, those models need to know everything, and they need to be very large in size, whereas our work has shown that if I know exactly what capabilities I’m looking for in that agent, I can specialize models for those capabilities, and be able to do that very efficiently with small models.