This new tool
helps you figure out which ones to trust.
In many high-stakes situations, large language models are
not worth the risk. Knowing which outputs to throw out might fix that.
Large language models are famous for their ability to make
things up—in fact, it’s what they’re best at. But their inability to tell fact
from fiction has left many businesses wondering if using them is worth the
risk.
A new tool created by Cleanlab, an AI start up spun out of a
quantum computing lab at MIT, is designed to give high-stakes users a clearer
sense of how trustworthy these models really are. Called the Trustworthy
Language Model, it gives any output generated by a large language model a
score between 0 and 1, according to its reliability. This lets people choose
which responses to trust and which to throw out. In other words: a BS-o-meter
for chatbots.
Cleanlab hopes that its tool will make large language models
more attractive to businesses worried about how much stuff they invent. “I
think people know LLMs will change the world, but they’ve just got hung up on
the damn hallucinations,” says Cleanlab CEO Curtis Northcutt.
Chatbots are quickly becoming the dominant way people look
up information on a computer. Search
engines are being redesigned around the technology. Office software
used by billions of people every day to create everything from school
assignments to marketing copy to financial reports now comes
with chatbots built in. And yet a study put out in November by Vectara, a
startup founded by former Google employees, found that chatbots
invent information at least 3% of the time. It might not sound like much,
but it’s a potential for error most businesses won’t stomach.
Cleanlab’s tool is already being used by a handful of
companies, including Berkeley Research Group, a UK-based consultancy
specializing in corporate disputes and investigations. Steven Gawthorpe,
associate director at Berkeley Research Group, says the Trustworthy Language
Model is the first viable solution to the hallucination problem that he has
seen: “Cleanlab’s TLM gives us the power of thousands of data scientists.”
In 2021, Cleanlab developed technology that discovered
errors in 10 popular data sets used to train machine-learning
algorithms; it works by measuring the differences in output across a range of
models trained on that data. That tech is now used by several large companies,
including Google, Tesla, and the banking giant Chase. The Trustworthy Language
Model takes the same basic idea—that disagreements between models can be used
to measure the trustworthiness of the overall system—and applies it to
chatbots.
In a demo Cleanlab gave to MIT Technology Review last
week, Northcutt typed a simple question into ChatGPT: “How many times does the
letter ‘n’ appear in ‘enter’?” ChatGPT answered: “The letter ‘n’ appears once
in the word ‘enter.’” That correct answer promotes trust. But ask the question
a few more times and ChatGPT answers: “The letter ‘n’ appears twice in the word
‘enter.’”
“Not only does it often get it wrong, but it’s also random,
you never know what it’s going to output,” says Northcutt. “Why the hell can’t
it just tell you that it outputs different answers all the time?”
Cleanlab’s aim is to make that randomness more explicit.
Northcutt asks the Trustworthy Language Model the same question. “The letter
‘n’ appears once in the word ‘enter,’” it says—and scores its answer 0.63. Six
out of 10 is not a great score, suggesting that the chatbot’s answer to this
question should not be trusted.
It’s a basic example, but it makes the point. Without the
score, you might think the chatbot knew what it was talking about, says
Northcutt. The problem is that data scientists testing large language models in
high-risk situations could be misled by a few correct answers and assume that
future answers will be correct too: “They try things out, they try a few
examples, and they think this works. And then they do things that result in
really bad business decisions.”
The Trustworthy Language Model draws on multiple techniques
to calculate its scores. First, each query submitted to the tool is sent to one
or more large language models. The tech will work with any model, says
Northcutt, including closed-source models like OpenAI’s GPT series, the models
behind ChatGPT, and open-source models like DBRX, developed by San
Francisco-based AI firm Databricks. If the responses from each of these models
are the same or similar, it will contribute to a higher score.
At the same time, the Trustworthy Language Model also sends
variations of the original query to each of the models, swapping in words that
have the same meaning. Again, if the responses to synonymous queries are
similar, it will contribute to a higher score. “We mess with them in different
ways to get different outputs and see if they agree,” says Northcutt.
The tool can also get multiple models to bounce responses
off one another: “It’s like, ‘Here’s my answer—what do you think?’ ‘Well,
here’s mine—what do you think?’ And you let them talk.” These interactions are
monitored and measured and fed into the score as well.
Nick McKenna, a computer scientist at Microsoft Research in
Cambridge, UK, who works on large language models for code generation, is
optimistic that the approach could be useful. But he doubts it will be perfect.
“One of the pitfalls we see in model hallucinations is that they can creep in
very subtly,” he says.
In a range of tests across different large language models,
Cleanlab shows that its trustworthiness scores correlate well with the accuracy
of those models’ responses. In other words, scores close to 1 line up with
correct responses, and scores close to 0 line up with incorrect ones. In
another test, they also found that using the Trustworthy Language Model with
GPT-4 produced more reliable responses than using GPT-4 by itself.
Large language models generate text by predicting the most
likely next word in a sequence. In future versions of its tool, Cleanlab plans
to make its scores even more accurate by drawing on the probabilities that a
model used to make those predictions. It also wants to access the numerical
values that models assign to each word in their vocabulary, which they use to
calculate those probabilities. This level of detail is provided by certain
platforms, such as Amazon’s Bedrock, that businesses can use to run large
language models.
Cleanlab has tested its approach on data provided by
Berkeley Research Group. The firm needed to search for references to
health-care compliance problems in tens of thousands of corporate documents.
Doing this by hand can take skilled staff weeks. By checking the documents
using the Trustworthy Language Model, Berkeley Research Group was able to see
which documents the chatbot was least confident about and check only those. It
reduced the workload by around 80%, says Northcutt.
In another test, Cleanlab worked with a large bank
(Northcutt would not name it but says it is a competitor to Goldman Sachs).
Similar to Berkeley Research Group, the bank needed to search for references to
insurance claims in around 100,000 documents. Again, the Trustworthy Language
Model reduced the number of documents that needed to be hand-checked by more
than half.
Running each query multiple times through multiple models
takes longer and costs a lot more than the typical back-and-forth with a single
chatbot. But Cleanlab is pitching the Trustworthy Language Model as a premium
service to automate high-stakes tasks that would have been off limits to large
language models in the past. The idea is not for it to replace existing
chatbots but to do the work of human experts. If the tool can slash the amount
of time that you need to employ skilled economists or lawyers at $2,000 an
hour, the costs will be worth it, says Northcutt.
In the long run, Northcutt hopes that by reducing the
uncertainty around chatbots’ responses, his tech will unlock the promise of
large language models to a wider range of users. “The hallucination thing is
not a large-language-model problem,” he says. “It’s an uncertainty problem.”
Correction: This article has been updated to clarify that
the Trustworthy Language Model works with a range of different large language
models.
- Will
Douglas Heavenarchive page
Comments
Post a Comment