AI hypotheses lag human
ones when put to the test Machines still face hurdles in identifying fresh
research paths, study suggests
]in May,
scientists at Future House, a San
Francisco–based nonprofit startup, announced they had identified a potential
drug to treat vision loss. Yet they couldn’t fully claim the discovery
themselves. Many steps in the scientific process—from literature search to
hypothesis generation to data analysis—had been conducted by an artificial
intelligence (AI) the team had built. All over the world, from computer science
to chemistry, AI is speeding up the scientific enterprise—in part by automating
something that once seemed a uniquely human creation, the production of
hypotheses. In a heartbeat, machines can now scour the ballooning research
literature for gaps, signaling fruitful research avenues that scientists might
otherwise miss. But how good are the ideas? A new study, one of the largest of
its kind, finds the AI-generated hypotheses still fall short of human ones,
when researchers put them through real-world tests and get human evaluators to
compare the results. But not by much. And maybe not for long. A paper
describing the experiment, posted to the arXiv preprint server in June,
suggests AI systems can sometimes embellish hypotheses, exaggerating their
potential importance. The study also suggests AI is not as good as humans at
judging the feasibility of testing the ideas it conjures up, says Chenglei Si,
a Ph.D. student in computer science at Stanford University and lead author of
the study. The research is drawing praise but also caution from others in the
field, in part because judging originality is so difficult. “Novelty is the
bugbear of scientific evaluation and one of the most diffiult tasks in peer
review,” says Jevin West, a data scientist at the University of Washington. The
study examined hypotheses about AI itself, in particular natural language
processing (NLP), which underpins AI tools called large language models (LLMs).
The researchers tasked Claude 3.5 Sonnet, an LLM developed by the startup
Anthropic, with generating thousands of ideas based on an analysis of NLP
studies in the Semantic Scholar database and ranking the most original ones.
The researchers then paid human NLP specialists to come up with competing
ideas. The team also recruited another group of computer scientists to judge
the novelty, feasibility, and other qualities of the two sets of ideas, to
which the reviewers were blinded. They gave the AI ideas higher marks on
average, a surprising finding the team reported in a 2024 preprint that drew
media attention. But the tables turned in the study’s second phase. After
advertising via social media and other routes— including on a T-shirt Si wore
to conferences—the team recruited a new team of paid NLP specialists to run
experiments for 24 of the AI-generated ideas and 19 human ones. The tests
typically examined how a proposed algorithm could improve an aspect of an LLM,
such as its language translations, and the experimenters were empowered to
tweak the study design by, for example, choosing a data set better suited to
evaluate the hypothesis. The team once again got independent evaluators to
judge the hypotheses. Average overall scores for the AI ideas dipped on a
10-point scale from 5.382 to 3.406, whereas human ideas fell from 4.596 to only
3.968. Si says the results show the importance of putting hypotheses to the
test. “If you only look at the ideas, some reviewers can get fooled by how
exciting certain words sound, but when you actually look at that code execution
or interpretation of it, you’ll realize it’s just a fancy or novel phrasing of
a known technique.” (That concern was echoed in a February study of 50 AI
hypotheses: Human evaluators judged one-third to have been plagiarized, with
another third partially borrowed from previous work. Only two were mostly novel
and none were completely novel.) The study is “really exciting” but has
limitations, says Dan Weld, chief scientist at the nonprofit Allen Institute
for Artificial Intelligence. For one, he says, it relied on a single LLM to
generate hypotheses based on a wide body of relevant research rather than using
multiple AI tools to scour highly cited studies written by prominent
specialists. Also, humans are not necessarily the best judges of novelty
either: Previous studies have found that actual researchers disagree
substantially in how they score the same computer science papers. An
experiment’s novelty is best evaluated in hindsight, after years of accruing
citations, West adds. Si says it would be too time consuming to have humans ground
truth AI-generated hypotheses as a matter of course. But LLMs could get better
at recognizing novel hypotheses if they are trained on the details of past,
successful experiments, he suggests. Despite the questions, the AI and human
scores were remarkably close—something that would have shocked researchers even
a few years ago. Weld wouldn’t be surprised if, eventually, AI comes up with
most hypotheses and scientists are left to carry out the parts of the experiments
that can’t be automated with robots. But if that’s the case, it re moves “the
most fun part of science” and leaves scientists to conduct lab work that “is
sometimes brain numbing,” West says. “Science is a social process that involves
humans. You take that out, then what is it?
Comments
Post a Comment