Do you want your accountant to be actively hallucinating?

AI-powered chatbots should really be called "bullshit machines." And maybe we shouldn't trust them with important questions.

You have, I am sure, heard of the Infinite Monkey Theorem, which states that a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type any given text, including the complete works of William Shakespeare.

The reason I know that is because I opened Microsoft’s Copilot chat window and asked “who came up with the idea of monkeys and typewriters?” It responded by basically rewriting the Wikipedia article on the topic about as effectively as your average college freshman might have. I would have deducted one full grade for leaving out the role of Jorge Luis Borges and his 1939 story “The Total Library,” but I realize that Borges isn’t an author most students are likely to encounter until their junior or senior year.

Anyway, I like to think of Copilot and other AI-powered chatbots as “bullshit machines,” designed to confidently answer any question in an earnest and convincing fashion, even to the point of making things up more or less completely.

In my personal experience, I’ve found that you get pretty high-quality results if you ask a question that’s relatively uncomplicated and not controversial. It’s even easier if there’s a Wikipedia article on the subject.

Things get more dicey when the questions get more complicated. For example, I recently asked Copilot for help with some complex scenarios involving state income taxes. I got super-confident answers telling me exactly how my scenario qualified for specific deductions, complete with hyperlinked footnotes showing the alleged sources of the advice.

I say alleged because when I followed the links to those articles (all from sketchy sources) I found that they didn’t say anything remotely like I was being advised. I used the Internet Archive’s Wayback Machine to confirm that they had never said those things, and when I searched for the specific advice I got … nothing.

When I asked ChatGPT the same questions, I got much more tentative answers, with no footnotes.  This was disappointing to the part of my brain that was really hoping for an authoritative answer, but slightly more reassuring to the rest of my body, which doesn’t want to go to prison for tax evasion.

I thought of this experience when I read an article from my colleague David Gewirtz over at ZDNET. He just did a fun experiment asking ChatGPT to create capsule descriptions of all 50 states, with an illustration and three short facts about each one. Then he asked some other AIs to fact-check GPTs work.

I fact-checked ChatGPT with Bard, Claude, and Copilot - and this AI was the most confidently incorrect

So here's the thing. If GPT-4, the OpenAI large language model (LLM) used by ChatGPT Plus, generated the fact statements, I wasn't entirely convinced it should be checking them. That's like asking high school students to write a history paper without using any references, and then self-correct their work. They're already starting with suspect information -- and then you're letting them correct themselves? No, that doesn't sound right to me.

But what if we fed those facts to other LLMs inside of other AIs? Both Google's Bard and Anthropic's Claude have their own LLMs. Bing uses GPT-4, but I figured I'd test its responses just to be completionist.

The results are worth reading, and this description of Bard’s “confidently incorrect” work seems on the money:

Bard's fact-checking seems impressive, but it often misses the point and gets things just as wrong as any other AI. … [M]ost of Bard's entries were either wrong or wrong-headed. And yet, Google puts its AI answers in front of most search results. Does that concern you? It sure worries me.

Why, yes, it does concern me.

Speaking of confidently incorrect answers, did you see this story about the attorney for Michael Cohen, the convicted felon who once worked for Donald J. Trump?

Michael Cohen lawyer David Schwartz cited fake cases: Judge

Cohen attorney David M. Schwartz submitted a letter motion on Nov. 29 arguing that there were three examples in 2022 of the U.S. Court of Appeals for the Second Circuit affirming the “early termination of supervised release.”

There was just one problem (or three problems): “As far as the Court can tell, none of these cases exist,” Judge Furman wrote Tuesday.

You definitely don’t want your accountant or your lawyer actively hallucinating like he just stepped out of Fear and Loathing in Las Vegas, which is what you’re likely to get with these AI tools in their current form. How long until we hear about an overworked doctor who killed a patient because they relied on something an AI told them?