Anthropic’s Claude 3.5 Sonnet large language model has gained a new ability: operating a computer.
The new ability, which the company is calling “computer use,” is currently in beta test. It enables developers to instruct Claude 3.5 Sonnet, through the Anthropic API, to read and interpret what’s on the display, type text, move the cursor, click buttons, and switch between windows or applications — much as today’s robotic process automation (RPA) tools can be instructed — much more laboriously — to do.
To apply its ability to use a computer, Claude 3.5 Sonnet starts from a prompt defining its goal, identifies the steps necessary to reach that goal, and then scans screenshots much as a human would look at the screen of a computer to figure out how to perform those steps.
Key to that is Claude 3.5 Sonnet’s new-found ability to return the coordinates of a feature in an image, enabling it to position the cursor on a button or in a text box on screen.
Claude 3.5 Sonnet needs definitions of the tools and software on the computer it will operate, and authorization to access them. It then sends requests to use the tools, and examines the response to see if has succeeded or whether it needs to continue using the tool to complete its task.