The robot race is fuelling a fight
for training data
AI is upending the way robots learn, leaving companies and
researchers with a need for more data. Getting it means wrestling with a host
of ethical and legal questions.
Since ChatGPT was released, we now interact with AI tools
more directly—and regularly—than ever before.
But interacting with robots, by way of contrast, is still a
rarity for most. If you don’t undergo complex surgery or work in logistics, the
most advanced robot you encounter in your daily life might still be a vacuum
cleaner (if you’re feeling young, the first Roomba was released 22 years ago).
But that’s on the cusp of changing. Roboticists believe that
by using new AI techniques, they will achieve something the field has pinned
after for decades: more capable robots that can move freely through unfamiliar
environments and tackle challenges they’ve never seen before.
“It’s like being strapped to the front of a rocket,” says
Russ Tedrake, vice president of robotics research at the Toyota Research
Institute, says of the field’s pace right now. Tedrake says he has seen plenty
of hype cycles rise and fall, but none like this one. “I’ve been in the field
for 20-some years. This is different,” he says.
But something is slowing that rocket down: lack of access to
the types of data used to train robots so they can interact more smoothly with
the physical world. It’s far harder to come by than the data used to train the
most advanced AI models like GPT—mostly text, images, and videos scraped off
the internet. Simulation programs can help robots learn how to interact with
places and objects, but the results still tend to fall prey to what’s known as
the “sim-to-real gap,” or failures that arise when robots move from the
simulation to the real world.
For now, we still need access to physical, real-world data
to train robots. That data is relatively scarce and tends to require a lot more
time, effort, and expensive equipment to collect. That scarcity is one of the
main things currently holding progress in robotics back.
As a result, leading companies and labs are in fierce
competition to find new and better ways to gather the data they need. It’s led
them down strange paths, like using robotic arms to flip pancakes for hours on
end, watching thousands of hours of graphic surgery videos pulled from YouTube,
or deploying researchers to numerous Airbnbs in order to film every nook and
cranny. Along the way, they’re running into the same sorts of privacy, ethics,
and copyright issues as their counterparts in the world of chatbots.
The new need for data
For decades, robots were trained on specific tasks, like
picking up a tennis ball or doing a somersault. While humans learn about the
physical world through observation and trial and error, many robots were
learning through equations and code. This method was slow, but even worse, it
meant that robots couldn’t transfer skills from one task to a new one.
Two ‘Firefly petunias’ perished in a shipping misadventure,
but other customers have had better luck.
But now, AI advances are fast-tracking a shift that had
already begun: letting robots teach themselves through data. Just as a language
model can learn from a library’s worth of novels, robot models can be shown a
few hundred demonstrations of a person washing ketchup off a
plate using robotic grippers, for example, and then imitate the task
without being taught explicitly what ketchup looks like or how to turn on the
faucet. This approach is bringing faster progress and machines with much more
general capabilities.
Now every leading company and lab is trying to enable robots
to reason their way through new tasks using AI. Whether they succeed will hinge
on whether researchers can find enough diverse types of data to fine-tune
models for robots, as well as novel ways to use reinforcement learning to let
them know when they’re right and when they’re wrong.
“A lot of people are scrambling to figure out what’s the
next big data source,” says Pras Velagapudi, chief technology officer of
Agility Robotics, which makes a humanoid robot that operates in warehouses for
customers including Amazon. The answers to Velagapudi’s question will help
define what tomorrow’s machines will excel at, and what roles they may fill in
our homes and workplaces.
Prime training data
To understand how roboticists are shopping for data, picture
a butcher shop. There are prime, expensive cuts ready to be cooked. There are
the humble, everyday staples. And then there’s the case of trimmings and
off-cuts lurking in the back, requiring a creative chef to make them into
something delicious. They’re all usable, but they’re not all equal.
For a taste of what prime data looks like for robots,
consider the methods adopted
by the Toyota Research Institute (TRI). Amid a sprawling laboratory in
Cambridge, Massachusetts, equipped with robotic arms, computers, and a random
assortment of everyday objects like dustpans and egg whisks, researchers teach
robots new tasks through teleoperation, creating what’s called demonstration
data. A human might use a robotic arm to flip a pancake 300 times in an
afternoon, for example.
The model processes that data overnight, and then often the
robot can perform the task autonomously the next morning, TRI says. Since the
demonstrations show many iterations of the same task, teleoperation creates
rich, precisely labeled data that helps robots perform well in new tasks.
The trouble is, creating such data takes ages, and it’s also
limited by the number of expensive robots you can afford. To create quality
training data more cheaply and efficiently, Shuran Song, head of the Robotics
and Embodied AI Lab at Stanford University, designed a device that can more
nimbly be used with your hands, and built at a fraction of the cost.
Essentially a lightweight plastic gripper,
it can collect data while you use it for everyday activities like cracking an
egg or setting the table. The data can then be used to train robots to mimic
those tasks. Using simpler devices like this could fast-track the data
collection process.
Open-source efforts
Roboticists have recently alighted upon another method for
getting more teleoperation data: sharing what they’ve collected with each
other, thus saving them the laborious process of creating data sets
alone.
The Distributed Robot Interaction Dataset (DROID), published last month, was
created by researchers at 13 institutions, including companies like Google
DeepMind and top universities like Stanford and Carnegie Mellon. It contains
350 hours of data generated by humans doing tasks ranging from closing a waffle
maker to cleaning up a desk. Since the data was collected using hardware that’s
common in the robotics world, researchers can use it to create AI models and
then test those models on equipment they already have.
The effort builds on the success of the Open X-Embodiment
Collaboration, a similar project from
Google DeepMind that aggregated data on 527 skills, collected from a variety of
different types of hardware. The data set helped build Google DeepMind’s RT-X
model, which can turn text instructions (for example, “Move the apple to the
left of the soda can”) into physical movements.
Robotics models built on open-source data like this can be
impressive, says Lerrel Pinto, a researcher who runs the General-purpose
Robotics and AI Lab at New York University. But they can’t perform across a
wide enough range of use cases to compete with proprietary models built by
leading private companies. What is available via open source is simply not
enough for labs to successfully build models at a scale that would produce the
gold standard: robots that have general capabilities and can receive
instructions through text, image, and video.
“The biggest limitation is the data,” he says. Only wealthy
companies have enough.
These companies’ data advantage is only getting more
thoroughly cemented over time. In their pursuit of more training data, private
robotics companies with large customer bases have a not-so-secret weapon: their
robots themselves are perpetual data-collecting machines.
Covariant, a robotics company founded in 2017 by OpenAI
researchers, deploys robots trained to identify and pick items in warehouses
for companies like Crate & Barrel and Bonprix. These machines constantly
collect footage, which is then sent back to Covariant. Every time the robot
fails to pick up a bottle of shampoo, for example, it becomes a data point to
learn from, and the model improves its shampoo-picking abilities for next time.
The result is a massive, proprietary data set collected by the company’s own
machines.
Is robotics about to have its own
ChatGPT moment?
Researchers are using generative AI and other techniques to
teach robots new skills—including tasks they could perform in homes.
This data set is part of why earlier this year Covariant was
able to release a powerful foundation
model, as AI models capable of a variety of uses are known. Customers can
now communicate with its commercial robots much as you’d converse with a
chatbot: you can ask questions, show photos, and instruct it to take a video of
itself moving an item from one crate to another. These customer interactions
with the model, which is called RFM-1, then produce even more data to help it
improve.
Peter Chen, cofounder and CEO of Covariant, says exposing
the robots to a number of different objects and environments is crucial to the
model’s success. “We have robots handling apparel, pharmaceuticals, cosmetics,
and fresh groceries,” he says. “It’s one of the unique strengths behind our
data set.” Up next will be bringing its fleet into more sectors and even having
the AI model power different types of robots, like humanoids, Chen says.
Learning from video
The scarcity of high-quality teleoperation and real-world
data has led some roboticists to propose bypassing that collection method
altogether. What if robots could just learn from videos of people?
Such video data is easier to produce, but unlike
teleoperation data, it lacks “kinematic” data points, which plot the exact
movements of a robotic arm as it moves through space.
Researchers from the University of Washington and Nvidia
have created a workaround, building a mobile app that lets people train robots
using augmented reality. Users take
videos of themselves completing simple tasks with their hands, like picking up
a mug, and the AR program can translate the results into waypoints for the
robotics software to learn from.
Meta AI is pursuing a similar collection method on a larger
scale through its Ego4D project,
a data set of more than 3,700 hours of video taken by people around the world
doing everything from laying bricks to playing basketball to kneading bread
dough. The data set is broken down by task and contains thousands of
annotations, which detail what’s happening in each scene, like when a weed has
been removed from a garden or a piece of wood is fully sanded.
Learning from video data means that robots can encounter a
much wider variety of tasks than they could if they relied solely on human
teleoperation (imagine folding croissant dough with robot arms). That’s
important, because just as powerful language models need complex and diverse
data to learn, roboticists can create their own powerful models only if they
expose robots to thousands of tasks.
To that end, some researchers are trying to wring useful
insights from a vast source of abundant but low-quality data: YouTube. With
thousands of hours of video uploaded every minute, there is no shortage of
available content. The trouble is that most of it is pretty useless for a
robot. That’s because it’s not labelled with the types of information robots
need, like annotations or kinematic data.
“You can say [to a
robot], Oh, this is a person playing Frisbee with their dog,” says Chen, of
Covariant, imagining a typical video that might be found on YouTube. “But it’s
very difficult for you to say, Well, when this person throws a Frisbee, this is
the acceleration and the rotation and that’s why it flies this way.”
Nonetheless, a few attempts have proved promising. When he
was a postdoc at Stanford, AI researcher Emmett Goodman looked into how AI
could be brought into the operating room to make surgeries safer and more
predictable. Lack of data quickly became a roadblock. In laparoscopic
surgeries, surgeons often use robotic arms to manipulate surgical tools
inserted through very small incisions in the body. Those robotic arms have
cameras capturing footage that can help train models, once personally
identifying information has been removed from the data. In more traditional
open surgeries, on the other hand, surgeons use their hands instead of robotic
arms. That produces much less data to build AI models with.
“That is the main barrier to why open-surgery AI is the
slowest to develop,” he says. “How do you actually collect that data?”
To tackle that problem, Goodman trained an AI model on
thousands of hours of open-surgery videos, taken by doctors with handheld or
overhead cameras, that his team gathered from YouTube (with identifiable
information removed). His model, as described in a paper in the medical
journal JAMA in
December 2023, could then identify segments of the operations from the videos.
This laid the groundwork for creating useful training data, though Goodman
admits that the barriers to doing so at scale, like patient privacy and
informed consent, have not been overcome.
Uncharted legal waters
Chances are that wherever roboticists turn for their new
troves of training data, they’ll at some point have to wrestle with some major
legal battles.
The makers of large language models are already having to
navigate questions of credit and copyright. A lawsuit filed by the New
York Times alleges that ChatGPT copies the expressive style of its
stories when generating text. The chief technical officer of OpenAI recently
made headlines when she said the company’s video generation tool Sora was
trained on publicly available data, sparking a critique from YouTube’s CEO, who
said that if Sora learned from YouTube videos, it would be a violation of the
platform’s terms of service.
“It is an area where there’s a substantial amount of legal
uncertainty,” says Frank Pasquale, a professor at Cornell Law School. If
robotics companies want to join other AI companies in using copyrighted works
in their training sets, it’s unclear whether that’s allowed under the fair-use
doctrine, which permits copyrighted material to be used without permission in a
narrow set of circumstances. An example often cited by tech companies and those
sympathetic to their view is the 2015 case of Google Books, in which courts
found that Google did not violate copyright laws in making a searchable
database of millions of books. That legal precedent may tilt the scales
slightly in tech companies’ favor, Pasquale says.
It’s far too soon to tell whether legal challenges will slow
down the robotics rocket ship, since AI-related cases are sprawling and still
undecided. But it’s safe to say that roboticists scouring YouTube or other
internet video sources for training data will be wading in fairly uncharted
waters.
The next era
Not every roboticist feels that data is the missing link for
the next breakthrough. Some argue that if we build a good enough virtual world
for robots to learn in, maybe we don’t need training data from the real world
at all. Why go through the effort of training a pancake-flipping robot in a
real kitchen, for example, if it could learn through a digital simulation of a
Waffle House instead?
Roboticists have long used simulator programs, which digitally
replicate the environments that robots navigate through, often down to details
like the texture of the floorboards or the shadows cast by overhead lights. But
as powerful as they are, roboticists using these programs to train machines
have always had to work around that sim-to-real gap.
Now the gap might be shrinking. Advanced image generation
techniques and faster processing are allowing simulations to look more like the
real world. Nvidia, which leveraged its experience in video game graphics to
build the leading robotics simulator, called Isaac Sim, announced last month
that leading humanoid robotics companies like Figure and Agility are using its
program to build foundation models. These companies build virtual replicas of
their robots in the simulator and then unleash them to explore a range of new
environments and tasks.
Deepu Talla, vice president of robotics and edge computing
at Nvidia, doesn’t hold back in predicting that this way of training will
nearly replace the act of training robots in the real world. It’s simply far
cheaper, he says.
“It’s going to be a million to one, if not more, in terms of
how much stuff is going to be done in simulation,” he says. “Because we can
afford to do it.”
But if models can solve some of the “cognitive” problems,
like learning new tasks, there are a host of challenges to realizing that
success in an effective and safe physical form, says Aaron Saunders, chief
technology officer of Boston Dynamics. We’re a long way from building hardware
that can sense different types of materials, scrub and clean, or apply a gentle
amount of force.
“There’s still a massive piece of the equation around how
we’re going to program robots to actually act on all that information to
interact with that world,” he says.
If we solved that problem, what would the robotic future
look like? We could see nimble robots that help people with physical
disabilities move through their homes,
autonomous drones that clean up pollution or hazardous waste, or surgical
robots that make microscopic incisions, leading to operations with a reduced
risk of complications. For all these optimistic visions, though, more
controversial ones are already brewing. The use of AI by
militaries worldwide is on the rise, and the emergence of autonomous
weapons raises troubling questions.
The labs and companies poised to lead in the race for data
include, at the moment, the humanoid-robot startups beloved by investors
(Figure AI was recently boosted by
a $675 million funding round), commercial companies with sizable fleets of
robots collecting data, and drone companies buoyed by significant military
investment. Meanwhile, smaller academic labs are doing more with less to
create data sets that rival those available to Big Tech.
But what’s clear to everyone I speak with is that we’re at
the very beginning of the robot data race. Since the correct way forward is far
from obvious, all roboticists worth their salt are pursuing any and all methods
to see what sticks.
There “isn’t really a consensus” in the field, says Benjamin
Burchfiel, a senior research scientist in robotics at TRI. “And that’s a
healthy place to be.”
Comments
Post a Comment