Artwork

Content provided by Kieran Gilmurray. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Kieran Gilmurray or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

The Agent Company Benchmark: Evaluating AI's Real-World Capabilities

32:45
 
Share
 

Manage episode 483683291 series 3535718
Content provided by Kieran Gilmurray. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Kieran Gilmurray or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.

The dizzying pace of AI development has sparked fierce debate about automation in the workplace, with headlines swinging between radical transformation and cautious scepticism. What's been missing is concrete, objective evidence about what AI agents can actually do in real professional settings.

TLDR:

  • The Agent Company benchmark creates a simulated software company environment to test AI agents on realistic work tasks
  • AI agents must navigate digital tools including GitLab, OwnCloud, Plane, and RocketChat while interacting with simulated colleagues
  • Even the best-performing model (Claude 3.5 Sonnet) only achieved 24% full completion rate across all tasks
  • Surprisingly, AI agents performed better on software engineering tasks than on seemingly simpler administrative or financial tasks
  • Common failure modes include lack of common sense, poor social intelligence, and inability to navigate complex web interfaces
  • Performance differences likely reflect biases in available training data, with coding having much more public data than administrative tasks
  • The gap between open source and closed source models appears to be narrowing, suggesting wider future access to capable AI systems

The Agent Company benchmark offers that much-needed reality check. By creating a fully simulated software company environment with all the digital tools professionals use daily—code repositories, file sharing, project management systems, and communication platforms—researchers can now rigorously evaluate how AI agents perform on authentic workplace tasks.
The results are eye-opening.

Even the best models achieved just 24% full completion across 175 tasks, with particularly poor performance in social interaction and navigating complex software interfaces.

Counterintuitively, AI agents performed better on software engineering tasks than on seemingly simpler administrative or financial work—likely reflecting biases in available training data.
The specific failure patterns reveal fundamental limitations: agents struggle with basic common sense (not recognizing a Word document requires word processing software), social intelligence (missing obvious implied actions after conversations), and web navigation (getting stuck on routine pop-ups that humans dismiss without thinking).

In one telling example, an agent unable to find a specific colleague attempted to rename another user to the desired name—a bizarre workaround that highlights flawed problem-solving approaches.
For anyone trying to separate AI hype from reality, this benchmark provides crucial context. While showing meaningful progress in automating certain professional tasks, it confirms that human adaptability, contextual understanding, and social navigation remain essential in the modern workplace.
Curious to explore the benchmark yourself? Visit theagentcompany.com or find the code repository on GitHub to join this important conversation about the future of work.

Research: THEAGENTCOMPANY: benchmarking llm agents on consequential real-world tasks

Support the show

𝗖𝗼𝗻𝘁𝗮𝗰𝘁 my team and I to get business results, not excuses.
☎️ https://calendly.com/kierangilmurray/results-not-excuses
✉️ [email protected]
🌍 www.KieranGilmurray.com
📘 Kieran Gilmurray | LinkedIn
🦉 X / Twitter: https://twitter.com/KieranGilmurray
📽 YouTube: https://www.youtube.com/@KieranGilmurray

  continue reading

Chapters

1. Perspectives on AI Work Automation (00:00:00)

2. The Agent Company Benchmark Explained (00:02:22)

3. Simulated Work Environment Design (00:06:01)

4. Task Structure and Evaluation Metrics (00:13:41)

5. Performance Results Across AI Models (00:19:06)

6. Surprising Failure Patterns Revealed (00:22:40)

7. Implications and Future Research Directions (00:29:20)

108 episodes

Artwork
iconShare
 
Manage episode 483683291 series 3535718
Content provided by Kieran Gilmurray. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Kieran Gilmurray or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://ppacc.player.fm/legal.

The dizzying pace of AI development has sparked fierce debate about automation in the workplace, with headlines swinging between radical transformation and cautious scepticism. What's been missing is concrete, objective evidence about what AI agents can actually do in real professional settings.

TLDR:

  • The Agent Company benchmark creates a simulated software company environment to test AI agents on realistic work tasks
  • AI agents must navigate digital tools including GitLab, OwnCloud, Plane, and RocketChat while interacting with simulated colleagues
  • Even the best-performing model (Claude 3.5 Sonnet) only achieved 24% full completion rate across all tasks
  • Surprisingly, AI agents performed better on software engineering tasks than on seemingly simpler administrative or financial tasks
  • Common failure modes include lack of common sense, poor social intelligence, and inability to navigate complex web interfaces
  • Performance differences likely reflect biases in available training data, with coding having much more public data than administrative tasks
  • The gap between open source and closed source models appears to be narrowing, suggesting wider future access to capable AI systems

The Agent Company benchmark offers that much-needed reality check. By creating a fully simulated software company environment with all the digital tools professionals use daily—code repositories, file sharing, project management systems, and communication platforms—researchers can now rigorously evaluate how AI agents perform on authentic workplace tasks.
The results are eye-opening.

Even the best models achieved just 24% full completion across 175 tasks, with particularly poor performance in social interaction and navigating complex software interfaces.

Counterintuitively, AI agents performed better on software engineering tasks than on seemingly simpler administrative or financial work—likely reflecting biases in available training data.
The specific failure patterns reveal fundamental limitations: agents struggle with basic common sense (not recognizing a Word document requires word processing software), social intelligence (missing obvious implied actions after conversations), and web navigation (getting stuck on routine pop-ups that humans dismiss without thinking).

In one telling example, an agent unable to find a specific colleague attempted to rename another user to the desired name—a bizarre workaround that highlights flawed problem-solving approaches.
For anyone trying to separate AI hype from reality, this benchmark provides crucial context. While showing meaningful progress in automating certain professional tasks, it confirms that human adaptability, contextual understanding, and social navigation remain essential in the modern workplace.
Curious to explore the benchmark yourself? Visit theagentcompany.com or find the code repository on GitHub to join this important conversation about the future of work.

Research: THEAGENTCOMPANY: benchmarking llm agents on consequential real-world tasks

Support the show

𝗖𝗼𝗻𝘁𝗮𝗰𝘁 my team and I to get business results, not excuses.
☎️ https://calendly.com/kierangilmurray/results-not-excuses
✉️ [email protected]
🌍 www.KieranGilmurray.com
📘 Kieran Gilmurray | LinkedIn
🦉 X / Twitter: https://twitter.com/KieranGilmurray
📽 YouTube: https://www.youtube.com/@KieranGilmurray

  continue reading

Chapters

1. Perspectives on AI Work Automation (00:00:00)

2. The Agent Company Benchmark Explained (00:02:22)

3. Simulated Work Environment Design (00:06:01)

4. Task Structure and Evaluation Metrics (00:13:41)

5. Performance Results Across AI Models (00:19:06)

6. Surprising Failure Patterns Revealed (00:22:40)

7. Implications and Future Research Directions (00:29:20)

108 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide

Listen to this show while you explore
Play