Mobile Menu

The dangers of LLM hallucinations and how to minimize them 

An image on a black background, representing a brain and machine learning with small mushroom icons to represent hallucinations.

by Joshua Bailey and Ryan Callihan

ChatGPT, Bard and other Large Language Model-based chatbots represent a significant stride in AI-powered conversational interfaces, having been fine-tuned to “comprehend” and respond to customer queries in a natural way.  

They can detect the underlying intent in customer queries quite well. Or, at least they can feel like that’s what they’re doing. That’s where the concept of LLM hallucinations starts to emerge. 

There is no doubt that for some tasks like summarization, LLMs are powerful but for many other tasks like logical reasoning and mathematics, they fall flat. They happen to get simple equations correct (think, 5+5 or 4-2) but when it starts to get more complex, the likelihood of getting a wrong answer becomes extremely common.  

This behavior is a clear sign that the model doesn’t really “understand” like a human. Instead, it functions by predicting the token (word) that “looks” best based on the user’s input. When the output is seemingly nonsensical or happens to generate false information, this is called an LLM hallucination. It is not the case of that model not working, but rather its generated output doesn’t align with expectation (or reality). 

Minimizing the LLM hallucinations

LLM hallucinations are an important feature (or perhaps it could be a bug?) of Large Language Models. But there are a couple of ways to reduce the rate at which an LLM will hallucinate.  

The first one is to engineer your prompt to encourage the model to use more tokens. As Andrej Karpathy said during his “State of GPT” talk at the Microsoft Build 2023 event, “GPT models need tokens to think.”  

One way of conceptualizing this is if you want to get more reasoning out of the model, you need to prompt it in such a way as to get it to write more. Specific examples of this are adding phrases like “Let’s think this through step by step” to the end of your prompt.  

This chain of thought style prompting has been used with good effect on reasoning tasks as shown in Large Language Models are Zero-Shot Reasoners by Kojima et al.  

As the title suggests, “Let’s think this through step by step” is a zero-shot way of doing it. You could potentially chain several of these prompts together in such a way as to get the models to reason and summarize, until we get to a potential answer which would use many more tokens but would be much more expensive. 

Reducing creative license to improve results 

Chain-of-thought is not the only way to reduce hallucinations. Another, and in our opinion better, way is by doing two things at the same time.  

The first is injecting specific context to the prompt which is relevant to a user’s query before actually answering or generating some response. For example, when writing a message on LinkedIn for a potential lead, including information about the company, what it offers, and what it should “sound like” improves the output massively, making it more personable and relevant.  

The second is simple, we reduce the randomization of the model close to zero. This ensures that the tokens chosen by the model are more probable, and therefore have less opportunity to generate fantastical responses. Essentially, it restricts the model’s “creative license”.  

However, there is no all-encompassing cure or fix for hallucinations in LLM tools like ChatGPT. There are ways to reduce them, but the best way to get accurate data is to explore other tools to use alongside them.  

At Relative Insight, we do just that, using LLMs for features where it makes sense. But when it comes to reproducible metrics and facts, we always use our own software to ensure quality and credibility.  

Get measurable metrics from your text data

Leave a Comment

Your email address will not be published. Required fields are marked *

One comment to “The dangers of LLM hallucinations and how to minimize them ”

  1. Dan says:

    Hi, thanks for the interesting write up. I enjoyed reading it. Given that some (https://arxiv.org/abs/2307.13702) have raised questions about the “faithfulness” of Chain of Thought reasoning for large models across differing tasks, do you think it is fair to suggest that these chains of thought are reflective of a model’s internal decision-making process? Or is it yet another kind of hallucination designed to give us humans the response that simply “seems right”?