Is the AI Acid Trip Over?
The prodigious and insightful Ray Schroeder recently wrote a piece for Inside Higher Ed about the prospect of LLMs hallucinating less and less. “Hallucination” is the term de jour to basically describe when generative AI makes things up. It’s a rather unfortunate term as I’ve argued in Is ChatGPT on Acid? But making stuff up is a real issue. Just like humans, gen AI will provide any answer even when it doesn’t know for certain whether it is true or false. Ever had a conversation with a human? Yeah we all know humans do this kind of thing all the time too.
Ray makes this analogy:
This is much like the test-taking strategy in certain standardized tests, for which subjects are advised to guess rather than not answering a question for which they don’t have a reliable answer. Hence, in order to achieve the best outcome, models invent answers that could be plausible, but for which they don’t have solid evidence. That, of course, undermines the validity of the response and the credibility of the tool.
But Ray cites recent studies and the advancement of AI models, such as “deep research” capabilities, that allow LLMs to go much further into fact finding than they were ever capable of (GPT-5 and Gemini 2.5 Pro for example).
Ray suggests that the LSD-laden days of the early LLM period might be coming to an end. Better models, better data, and way more access to realtime data–remember when it wasn’t connected to the Internet?–and tuning to minimize the “answer at all cost” problem that has plagued LLMs from day one could all largely be solved problems. Ray is hopeful that this will allow these tools to better help scholars and students going forward. Allowing them to live up to the Assistants for the Rest of Us, that I outlined earlier this year.
But after reading Ray’s piece, both Zach and I were left with some lingering questions, and perhaps doubts. Neither of us are computer scientists or AI experts but we are avid users and observers.
Zach pushed back, saying “I’m not so sure. Are hallucinations backed into LLMs? If the rate goes all the way down to 0.01% is that good enough for all tasks?”
Clearly for some tasks, a fail rate of even 0.01% could prove catastrophic, such as a health diagnosis, engineering plan, air travel etc. But I counter that a 0.01% rate of hallucination would be totally fine for academic research, such as citation gathering and data interpretation. After all, students and researchers still need to take ownership of the prompts, output, and application of gen AI. Double check your work and all.
I posed the question, what is the error rate for humans? Is it higher or lower than gen AI? If I ask a research assistant to get me five books about hydropower from the library, I am just about 100% certain that they will return with five books, all on hydropower. They may not be good ones or the ones I would have chosen but coming back empty handed is hard to imagine.
The question is: what is an acceptable error rate for different gen AI tasks and queries, and does that change based on discipline? Engineering vs sociology, for example.
Traditional computers based on 1 and 0s trained us for binary output. Hit the N key and an N appears on the screen. But LLMs aren’t like that. Solving the hallucination problem will come through better models but likely we are simultaneously training ourselves to have different expectations. My old computer couldn’t type the N key on its own but my new one can. With that comes new advantages and disadvantages.
My last thought on Ray’s piece is that if the hallucination rates keep dropping towards zero, this has the potential to remove one of the objections to gen AI that critics in academia and elsewhere have raised. The implications of this are significant. Gen AI would become even more indispensable for professional and academic work. More than that, anyone with a smartphone and Internet connection–last I checked about 5 billion people–will have access to tools that for most of humanity only a small percent of us had.