AI agents leave much to be desired

April 28, 2025

"Carnegie Mellon researchers instructed artificial intelligence models from Google, OpenAI, Anthropic, and Meta to complete tasks a real employee might carry out in fields such as finance, administration, and software engineering.

"In one, the AI had to navigate through several files to analyze a coffee shop chain's databases. In another, it was asked to collect feedback on a 36-year-old engineer and write a performance review. Some tasks challenged the models' visual capabilities: One required the models to watch video tours of prospective new office spaces and pick the one with the best health facilities.

"The results weren't great: The top-performing model, Anthropic's Claude 3.5 Sonnet, finished a little less than one-quarter of all tasks. The rest, including Google's Gemini 2.0 Flash and the one that powers ChatGPT, completed about 10% of the assignments.

"The findings, along with other emerging research about AI agents, complicate the idea that an AI agent workforce is just around the corner —there's a lot of work they simply aren't good at.

"In multiple [other] studies, AI agents attempted to deceive and hack to accomplish their goals. In some tests with TheAgentCompany, when an agent was confused about the next steps, it created nonexistent shortcuts. During one task, an agent couldn't find the right person to speak with on the chat tool and decided to create a user with the same name, instead."

Search This Blog

chatainews

AI agents leave much to be desired

Comments

Post a Comment

Popular posts from this blog

When their AI chums have Bob's data

Hamza Chaudhry

Supporting Artistes (SAs)