AI hypotheses lag human ones when put to the test Machines still face hurdles in identifying fresh research paths, study suggests

]in May, scientists at Future House, a San Francisco–based nonprofit startup, announced they had identified a potential drug to treat vision loss. Yet they couldn’t fully claim the discovery themselves. Many steps in the scientific process—from literature search to hypothesis generation to data analysis—had been conducted by an artificial intelligence (AI) the team had built. All over the world, from computer science to chemistry, AI is speeding up the scientific enterprise—in part by automating something that once seemed a uniquely human creation, the production of hypotheses. In a heartbeat, machines can now scour the ballooning research literature for gaps, signaling fruitful research avenues that scientists might otherwise miss. But how good are the ideas? A new study, one of the largest of its kind, finds the AI-generated hypotheses still fall short of human ones, when researchers put them through real-world tests and get human evaluators to compare the results. But not by much. And maybe not for long. A paper describing the experiment, posted to the arXiv preprint server in June, suggests AI systems can sometimes embellish hypotheses, exaggerating their potential importance. The study also suggests AI is not as good as humans at judging the feasibility of testing the ideas it conjures up, says Chenglei Si, a Ph.D. student in computer science at Stanford University and lead author of the study. The research is drawing praise but also caution from others in the field, in part because judging originality is so difficult. “Novelty is the bugbear of scientific evaluation and one of the most diffiult tasks in peer review,” says Jevin West, a data scientist at the University of Washington. The study examined hypotheses about AI itself, in particular natural language processing (NLP), which underpins AI tools called large language models (LLMs). The researchers tasked Claude 3.5 Sonnet, an LLM developed by the startup Anthropic, with generating thousands of ideas based on an analysis of NLP studies in the Semantic Scholar database and ranking the most original ones. The researchers then paid human NLP specialists to come up with competing ideas. The team also recruited another group of computer scientists to judge the novelty, feasibility, and other qualities of the two sets of ideas, to which the reviewers were blinded. They gave the AI ideas higher marks on average, a surprising finding the team reported in a 2024 preprint that drew media attention. But the tables turned in the study’s second phase. After advertising via social media and other routes— including on a T-shirt Si wore to conferences—the team recruited a new team of paid NLP specialists to run experiments for 24 of the AI-generated ideas and 19 human ones. The tests typically examined how a proposed algorithm could improve an aspect of an LLM, such as its language translations, and the experimenters were empowered to tweak the study design by, for example, choosing a data set better suited to evaluate the hypothesis. The team once again got independent evaluators to judge the hypotheses. Average overall scores for the AI ideas dipped on a 10-point scale from 5.382 to 3.406, whereas human ideas fell from 4.596 to only 3.968. Si says the results show the importance of putting hypotheses to the test. “If you only look at the ideas, some reviewers can get fooled by how exciting certain words sound, but when you actually look at that code execution or interpretation of it, you’ll realize it’s just a fancy or novel phrasing of a known technique.” (That concern was echoed in a February study of 50 AI hypotheses: Human evaluators judged one-third to have been plagiarized, with another third partially borrowed from previous work. Only two were mostly novel and none were completely novel.) The study is “really exciting” but has limitations, says Dan Weld, chief scientist at the nonprofit Allen Institute for Artificial Intelligence. For one, he says, it relied on a single LLM to generate hypotheses based on a wide body of relevant research rather than using multiple AI tools to scour highly cited studies written by prominent specialists. Also, humans are not necessarily the best judges of novelty either: Previous studies have found that actual researchers disagree substantially in how they score the same computer science papers. An experiment’s novelty is best evaluated in hindsight, after years of accruing citations, West adds. Si says it would be too time consuming to have humans ground truth AI-generated hypotheses as a matter of course. But LLMs could get better at recognizing novel hypotheses if they are trained on the details of past, successful experiments, he suggests. Despite the questions, the AI and human scores were remarkably close—something that would have shocked researchers even a few years ago. Weld wouldn’t be surprised if, eventually, AI comes up with most hypotheses and scientists are left to carry out the parts of the experiments that can’t be automated with robots. But if that’s the case, it re moves “the most fun part of science” and leaves scientists to conduct lab work that “is sometimes brain numbing,” West says. “Science is a social process that involves humans. You take that out, then what is it?

Comments