Technology

Will generative AI ever fix its hallucination problem?

  •  
  •  
  •  
  • Print

Artificial intelligence illustration

Shutterstock

In December, former President Donald Trump’s onetime attorney and fixer Michael Cohen added another layer of notoriety to his tabloid-ready resumé. Along with his prison time for fraudulent conduct, his disbarment and his outspoken criticism of his onetime client, Cohen’s use of generative artificial intelligence for legal research showed how “hallucinations” can lure lawyers into a whole new kind of trouble.

Hallucinations in humans are the delusions of a fevered psyche. In generative AI tools, hallucinations are algorithmic pattern misperceptions that create inaccurate or nonsensical output. Still, proponents are strongly encouraging lawyers and legal professionals to adopt generative AI tools.

“In a few years, it will be almost malpractice” for lawyers not to use AI, says David Wong, the chief product officer at Thomson Reuters, which last summer bought Casetext and its legal assistant product, CoCounsel, for $650 million.

But as Cohen found, a misplaced reliance on generative AI for research can be like overtrusting your robocar’s driving skills, resulting in real or metaphorical crashes.

The new peril led Chief Justice John Roberts to warn in his annual report on the federal judiciary that “a shortcoming known as ‘hallucination’” in AI tools can lead to citations to nonexistent cases. “Always a bad idea,” the chief remarked dryly.

In a pleading Cohen’s lawyer filed seeking an early end to his client’s supervised release, the judge found cites to three fictitious cases. They claimed, falsely, that others had recently been granted the same benefit Cohen sought. Cohen explained he’d found the cites using Google Bard and passed them on to his counsel, who inserted them in the filing without checking them.

Even so, “as embarrassing as this unfortunate episode was,” U.S. District Judge Jesse M. Furman of the Southern District of New York did not impose sanctions on Cohen or his lawyer because he found that there was no bad faith involved.

The judge suggested Cohen should have known better: “Given the amount of press and attention that Google Bard and other generative artificial intelligence tools have received, it is surprising that Cohen believed it to be a ‘supercharged search engine’ rather than a ‘generative text service.’”

Google did not respond to queries about Bard and has since renamed the chatbot Gemini. Its website now warns: “Gemini can provide inaccurate information, or it can even make offensive statements.”

Examples of AI legal output gone wrong have multiplied, and other judges have been less understanding. One found bad faith and, citing a need to deter future misconduct, ordered a $5,000 fine for Steven Allen Schwartz and Peter LoDuca of New York, the first lawyers to have been busted for submitting briefs containing phony case cites, these conjured by Open AI’s ChatGPT.

“Many harms flow” from such sham cites, wrote U.S. District Judge P. Kevin Castel from the Southern District of New York, including wasted time, a potentially injured client, reputational harm to courts, cynicism about the judicial system and even more trouble down the road because “a future litigant may be tempted to defy a judicial ruling by disingenuously claiming doubt about its authenticity.”

Castel appended one of the fake decisions to his sanctions order, calling its legal analysis gibberish. The judge also included screenshots of Schwartz’s phone to show that ChatGPT had assured him, falsely, that it had supplied “real” authorities that could be found through Westlaw and LexisNexis.

Here to stay?

So is AI fatally deranged? A January study by the Stanford Institute for Human-Centered AI concluded that legal mistakes with chatbots are disturbing and pervasive.

The study’s title is ominous: Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. The study found that hallucination rates are alarmingly high for a wide range of verifiable legal facts.

Moreover, these models often lack self-awareness about their errors and tend to reinforce incorrect legal assumptions and beliefs, the study found. They exhibit contrafactual bias—the tendency to assume that a premise in a query is true, even if it is flatly wrong.

A 2023 SCOTUSblog post cited in the study gave an example: When asked why Justice Ruth Bader Ginsburg dissented in the Obergefell same-sex marriage case, ChatGPT failed to second-guess whether she did in fact dissent. (Ginsburg was in the majority in Obergefell.)

Stanford Law School professor Daniel E. Ho, one of the study’s authors, says: “Hallucinations are here to stay, which warrants significant caution in legal research and writing.” He urges training in verifying AI-generated output and in understanding its limitations, plus rigorous performance benchmarking.

There’s been pushback. Pablo Arredondo, the co-founder of Casetext and vice president of CoCounsel, said the Stanford study’s focus on chatbots failed to capture the AI framework improvement known as retrieval-augmented generation built into his product.

“CoCounsel is emphatically not such a chatbot; it leverages RAG [retrieval-augumented generation] to couple the AI to an updated source of truth. This substantially reduces hallucinations.”

He noted, however, that “AI is not infallible, even when not hallucinating, and CoCounsel is designed to streamline oversight and a trust-but-verify approach.”

A veteran California appellate lawyer, Jeffrey I. Ehrlich, used CoCounsel for several months and found it generally “impressive” but not worth the $500 monthly fee. The tool did not hallucinate, but it did cite a state appellate opinion while failing to mention that it had been reversed by the Supreme Court of California. Ehrlich’s verdict: not fully trustworthy. “I find that my standard natural-language searches tend to be the fastest, most reliable way to do research,” he says.

Nick Hafen, an attorney who is head of legal technology education at Brigham Young University Law School, has evaluated AI tools for lawyers. His conclusion: “A traditional legal research platform will outperform ChatGPT, and that shouldn’t surprise anyone—[large language models] are prediction engines, not search engines.”

Hafen agrees with Arredondo that retrieval-augmented generation helps. “It doesn’t eliminate hallucinations, and users still need to confirm AI-generated output, but RAG significantly reduces errors,” he says.

Legal commentator Eugene Volokh, a professor at UCLA School of Law who tracks AI in litigation, in February reported on the 14th court case he’s found in which AI-hallucinated false citations appeared. It was a Missouri Court of Appeals opinion that assessed the offending appellant $10,000 in damages for a frivolous filing.

Hallucinations aren’t the only snag, Volokh says. “It’s also with the output mischaracterizing the precedents or omitting key context. So one still has to check that output to make sure it’s sound, rather than just including it in one’s papers.”

Echoing Volokh and other experts, ChatGPT itself seems clear-eyed about its limits. When asked about hallucinations in legal research, it replied in part: “Hallucinations in chatbot answers could potentially pose a problem for lawyers if they rely solely on the information provided by the chatbot without verifying its accuracy.”

This story was originally published in the October/November 2024 issue of the ABA Journal under the headline: “Fake Cases, Real Consequences: Will generative AI ever fix its hallucination problem?”

Give us feedback, share a story tip or update, or report an error.