Just today, the LLM based auto-review that my company enabled for all PRs edited my PR description to confidently assert that I had added a new RPC. I had not. I deleted code and nothing else. Nothing was added. The RPC it claimed I added did not exist.
LLMs are nondeterministic, so it’s impossible to make something 100% reproducible. Even if it has an issue, it might do it in a different way. If it’s well publicized, they’ll patch that very specific example, but the foundational issue is still there (like counting the R’s in strawberry).
I still regularly run into the issue where it just makes up API endpoints, CLI commands, or add flags that simply don’t exist.
I also regularly ask it things and it gives me a bad answers, so I push back, and it says something to the effect of “you’re right, I didn’t consider that, let me look at that more”… then tells me the exact opposite of the previous response.
Or it “thing X has never happened”, and I ask what about <insert example>, and it goes to look it up and says, “oh, thing X actually did happen.”
I run into this daily. Multiple times per day. How can I trust a system like this? Are people just blindly accepting what the LLM says as truth? Is that why people think it’s good?
I just used ChatGPT only, twice. Web interface in a Firefox private window, and in a Chrome incognito window. I asked them both the identical question "How many names of the days of the week contain the letter D?"
In Firefox I got 6. In Chrome I got 7. LLMs are not even self-consistent.
Bad example but since it literally just happened a few hours ago:
Teams Copilot meeting assistant auto-renamed a meeting title/summary that’s now prominently placed at the top to “Month end close wrap up discussion“ because someone posted in chat “sorry can’t make the meeting, we’re wrapping up month end close”.
Really confused the next guy who joined the meeting and derailed things for a minute or two before we could get back on topic.
I often hear this. Can you give me a question where a major LLM hallucinates or provides poor guidance? Reproducible would be great
Just a question to stump it.