All generative AI models hallucinate, from Google's Gemini to Anthropic's Claude to OpenAI's latest covert release, GPT-4o. In other words, the models are untrustworthy narrators, which can be amusing at times but sometimes troublesome.
However, not all models make stuff up at the same rate. And the types of lies they spread are determined by the information sources to which they have been exposed.
A recent study by academics at Cornell, the universities of Washington and Waterloo, and the nonprofit research institute AI2 attempted to benchmark hallucinations by fact-checking models such as GPT-4o against authentic sources on domains ranging from law and health to history and geography. They discovered that no model performed extraordinarily well across all themes, and that the models that hallucinated the least did so in part because they refused to answer questions that would otherwise be incorrect.
“The most important takeaway from our work is that we cannot yet fully trust the outputs of model generations,” Wenting Zhao, a doctoral student at Cornell and a co-author on the research, told TechCrunch. “At present, even the best models can generate hallucination-free text only about 35% of the time.”
To make their benchmark more difficult — and to better reflect the types of inquiries people ask of models — the researchers picked topics on the internet that do not have a Wikipedia reference. Just over half of the questions in their test cannot be answered using Wikipedia (they did include some Wikipedia-sourced questions for good measure), and they cover themes such as culture, geography, astronomy, pop culture, finance, medicine, computer science, and celebrities.
The researchers assessed over a dozen popular models for their study, many of which had been published within the previous year. In addition to GPT-4o, they tested "open" models including Meta's Llama 3 70B, Mistral's Mixtral 8x22B, and Cohere's Command R+, as well as gated-behind-API models such Perplexity's Sonar Large (based on Llama), Google's Gemini 1.5 Pro, and Anthropic's Clause 3 Opus.
The findings indicate that models aren't hallucinating any less these days, despite assertions to the contrary from OpenAI, Anthropic, and other major generative AI firms.
The researchers assessed over a dozen popular models for their study, many of which had been published within the previous year. In addition to GPT-4o, they tested "open" models including Meta's Llama 3 70B, Mistral's Mixtral 8x22B, and Cohere's Command R+, as well as gated-behind-API models such Perplexity's Sonar Large (based on Llama), Google's Gemini 1.5 Pro, and Anthropic's Clause 3 Opus.
The findings indicate that models aren't hallucinating any less these days, despite assertions to the contrary from OpenAI, Anthropic, and other major generative AI firms.
Even models that can search the web for information, such as Command R and Perplexity's Sonar models, struggled to answer "non-Wiki" queries in the benchmark. Model size made little difference; smaller models (such as Anthropic's Clause 3 Haiku) hallucinated about as frequently as bigger, purportedly more capable models.
So, what does all of this imply, and where are the improvements that vendors promised?
We wouldn't be surprised if suppliers made exaggerated claims. However, a more generous interpretation is that the criteria they're utilizing aren't appropriate for this purpose. As previously stated, many, if not most, AI evaluations are ephemeral and devoid of critical context, making them vulnerable to Goodhart's law.
Regardless, Zhao believes that the issue of hallucinations will "persist for a long time."
"Empirical results in our paper indicate that, despite the promise of certain methods to reduce or eliminate hallucinations, the actual improvement achievable with these methods is limited," according to her. "Additionally, our analysis reveals that even the knowledge found on the internet can often be conflicting, partly because the training data — authored by humans — can also contain hallucinations."
An interim approach could be to simply train models to refuse to respond more frequently – the technological equivalent of asking an expert to shut up.
In the researchers' tests, Claude 3 Haiku only responded to about 72% of the questions, opting to skip the rest. When the abstentions were taken into consideration, Claude 3 Haiku was the most factual model of them all — at least in terms of lying less frequently.
Will people adopt a model that just answers a few questions? Zhao disagrees, arguing that vendors should devote more time and resources to hallucination-reducing research. She claims that while it is impossible to completely eliminate hallucinations, they can be reduced through human-in-the-loop fact-checking and citation during model building.
"Policies and regulations need to be developed to ensure that human experts are always involved in the process to verify and validate the information generated by generative AI models," according to Zhao. "There are still numerous opportunities to make significant impacts in this field, such as developing advanced fact-checking tools for any free text, providing citations for factual content and offering corrections for hallucinated texts."