Will Generative AI Transform Robotics?

September 1, 2024

London, UK

In an influential essay in 2019, entitled ‘The Bitter Lesson’, machine learning researcher Richard Sutton observed that the main driver of progress in artificial intelligence (AI) is the continued scaling up of computational power1. This view predicts that while manual approaches that embed human knowledge and understanding in AI agents lead to satisfying advances in the short term, in the long run they only stand in the way of developing more general, scalable methods. This provocative conclusion has led to heated debates about the role of human ingenuity, but the ‘bitter lesson’ paradigm has more or less played out in the area of natural language processing. By using scaled-up neural networks and as many text examples from the internet as possible as training data, researchers could solve previously complex problems of producing human language without syntactical errors. Further scaling has produced general-purpose and multimodal models with billions of parameters such as GPT-4, Claude, Gemini and Llama that have game-changing applications in science and society.

Perhaps the time has arrived for robotics to learn its own bitter lesson and to benefit from a substantial scaling up and incorporating large amounts of training data. At the recent annual International Conference on Robotics and Automation (ICRA), several experts debated the statement “Generative AI will make a lot of traditional robotics approaches obsolete.” The field could certainly do with new ideas, as after decades of painstaking computational development and engineering, robotics methods for perception, motion planning, reasoning, grasping, manipulation and human–robot interactions are far from being able to navigate the complex and unpredictable human world. Indeed, deep learning methods are starting to compete with traditional approaches in robot control and sensor data processing. The promise of large generative AI models that, with sufficient training data, can generalize to different tasks and situations is tantalizing.

However, gathering training data for robots is costly and slow. In the ICRA panel, Jeannette Bohg from Stanford University made a back-of-the-envelope estimate that to reach a similar amount of data available for natural language processing, from streams of images and text produced by internet users, robotics training data needs to scale up by a factor of 27 million. This sounds daunting, but Bohg pointed out that there is no fundamental obstacle to achieving this goal. Researchers can rise to the challenge and put a substantial amount of effort into gathering good quality robotics data. Notably, a recent community effort named ‘Open X-Embodiment’ has produced a dataset of 22 robots, 527 skills and 160,266 tasks, which seems a sizeable start.

However, the feasibility of ever gathering sufficient data to develop a general-purpose robotics model is questionable. The complexity of real-world interactions is enormous, and high standards in reliability and robustness are needed. A high zero-shot performance of 50 percent or even 75 percent is an impressive achievement in the laboratory setting, but unacceptable in real-world interactions. In the debate, Chad Jenkins from the University of Michigan highlighted the problem of reliability and trust: can we be sure that a general-purpose robotics model is really going to work when we need it to? Although it might not be disastrous if a chatbot hallucinates answers, machines operating in the real world, interacting with humans, need to be safe and reliable. In Jenkins’ view, robotics will always need to turn to models based on physical understanding of the world.

Elsewhere at ICRA, researchers already explore the feasibility using large vision–language models for their robots. The initial results show a promising jump in capabilities and robustness in scene understanding, human–robot interaction, and even action planning. Large vision–language models such as GPT-4 and Gemini have absorbed internet-scale amounts of data from human users, and can arguably replicate a type of ‘common sense’ practical knowledge of the world that can potentially be used in robotics. It is also clear that this type of common-sense knowledge comes with substantial reliability issues and is unlike human-like understanding. The semantic knowledge of everyday concepts that comes naturally to large vision–language models could already be harnessed in robotics’ scene understanding and interactions with humans.

But complex problems that come with acting in a dynamically changing world remain. How robots can physically interact with their environment will depend on their bodies (also called affordances), and a next step is highlighted in the ‘SayCan’ project at Google Research, in which the PaLM model is grounded in the affordances of real-world mobile robots. A related research direction is to develop vision–language models with an advanced, physical common-sense understanding of the world. An essential ingredient is curated data collection of examples from videos for a better understanding of physical properties of objects and physical effects in manipulating them.

There is a substantial momentum in robotics with big tech backing start-ups and initiatives6 and no doubt robots will become more prominent in society, given improvements in hardware, computational efficiency and the current momentum in AI. Designing robots that can safely and reliably operate in the real world remains a challenging problem, but large vision–language models and generative AI are injecting the field with fresh ideas.

This article was written by the Nature Machine Intelligence editorial team. For more information, contact Alice Kay at This email address is being protected from spambots. You need JavaScript enabled to view it. . For more information, download the Technical Support Package (free white paper) below. ADTTSP-09242

This Brief includes a Technical Support Package (TSP).

Will Generative AI Transform Robotics?

(reference ADTTSP-09242) is currently available for download from the TSP library.

Don't have an account?

More From SAE Media Group

Overview

The editorial from Nature Machine Intelligence discusses the potential impact of generative AI and large vision-language models on the field of robotics. It highlights the ongoing excitement surrounding these technologies, particularly in their ability to enhance robotic capabilities in navigating complex real-world environments. The piece references Richard Sutton's influential essay, "The Bitter Lesson," which posits that the key to progress in artificial intelligence lies in the scaling up of computational power and the use of extensive training data, rather than relying solely on human knowledge and manual approaches.

At the recent International Conference on Robotics and Automation (ICRA), experts debated the assertion that generative AI could render many traditional robotics methods obsolete. The consensus suggests that while generative AI holds promise, significant challenges remain. Current robotics methods for perception, motion planning, and human-robot interaction are still inadequate for the unpredictable nature of human environments. Deep learning techniques are beginning to rival traditional approaches, but the need for vast amounts of high-quality training data is a major hurdle. Jeannette Bohg from Stanford University estimated that robotics would need to scale its training data by a factor of 27 million to match the data available for natural language processing.

The editorial also addresses the feasibility of developing a general-purpose robotics model, noting the complexity of real-world interactions and the high standards required for reliability and safety. Chad Jenkins from the University of Michigan raised concerns about the trustworthiness of such models, emphasizing that while a chatbot's errors may be harmless, robots interacting with humans must be dependable.

Despite these challenges, researchers are exploring the use of large vision-language models, such as GPT-4 and Gemini, in robotics. Initial results indicate improvements in scene understanding, human-robot interaction, and action planning. These models, trained on vast amounts of internet data, can provide a form of common-sense knowledge that may enhance robotic capabilities. However, the editorial cautions that this knowledge comes with reliability issues and is not equivalent to human understanding.

In summary, while generative AI presents exciting opportunities for robotics, significant challenges in data collection, reliability, and real-world application remain to be addressed.

Will Generative AI Transform Robotics?

In the current wave of excitement about applying large vision–language models and generative AI to robotics, expectations are running high, but conquering real-world complexities remains challenging for robots.

This Brief includes a Technical Support Package (TSP).

Will Generative AI Transform Robotics?

More From SAE Media Group

Aerospace & Defense Tech Briefs

Tech Briefs

Tech Briefs

Tech Briefs

Tech Briefs

Tech Briefs

Tech Briefs

Photonics & Imaging Technology

Aerospace & Defense Tech Briefs

Tech Briefs

Robotics & Automation INSIDER

Motion Design

Sensor Technology

Tech Briefs

Motion Design

Aerospace & Defense Tech Briefs

Motion Design

Tech Briefs

Tech Briefs

Motion Design

Tech Briefs

Tech Briefs

Motion Design

Aerospace & Defense Tech Briefs

Robotics & Automation INSIDER

Defense INSIDER

Aerospace & Defense Tech Briefs

Tech Briefs

Aerospace & Defense Tech Briefs

Tech Briefs

Overview

Top Stories

NewsSensors/Data Acquisition

INSIDERRF & Microwave Electronics

INSIDERWeapons Systems

NewsAutomotive

INSIDERAerospace

ArticlesTransportation

Webcasts

Aerospace

Energy

Power

Automotive

Electronics & Computers

Unmanned Systems