AI agents leave much to be desired
"In one, the AI had to navigate through several files to analyze a coffee shop chain's databases. In another, it was asked to collect feedback on a 36-year-old engineer and write a performance review. Some tasks challenged the models' visual capabilities: One required the models to watch video tours of prospective new office spaces and pick the one with the best health facilities.
"The results weren't great: The top-performing model, Anthropic's Claude 3.5 Sonnet, finished a little less than one-quarter of all tasks. The rest, including Google's Gemini 2.0 Flash and the one that powers ChatGPT, completed about 10% of the assignments.
"The findings, along with other emerging research about AI agents, complicate the idea that an AI agent workforce is just around the corner —there's a lot of work they simply aren't good at.
"In multiple [other] studies, AI agents attempted to deceive and hack to accomplish their goals. In some tests with TheAgentCompany, when an agent was confused about the next steps, it created nonexistent shortcuts. During one task, an agent couldn't find the right person to speak with on the chat tool and decided to create a user with the same name, instead."
Comments
Post a Comment
Empathy recommended