OpenAI released the first Large Language Model GPT-2 in 2019, followed by GPT-3 in 2020 and GPT-4 in 2023. Anthropic released Claude in 2023, Google released Gemini in 2023. The models kept getting better over time, but the hallucination issue remained persistent. Given LLMs are fundamentally next token prediction models, achieving a balance between creativity and hallucination will continue to be a hard problem.
The way LLM output is being consumed today, hallucination is not a big issue. People usually interact with LLMs through chat interfaces. LLMs support tool calling, but the tool calling is usually limited to a few safe tools like web browsing, file access, etc. The output is usually verified by a human before it is used. Over time, LLMs will creep into every aspect of our lives. The way we interact with phones, cars, computers, refrigerators, etc. will change. When this happens, verifying the outputs of LLM before it is used to take important decisions will be critical.
Moreover, given how fast LLMs can generate an output, manually verifying this output will be the time-consuming step. As Andrej Karpathy mentioned in his recent Software Is Changing (Again) talk - "Verification is the Bottleneck".
If you've used code generation tools like Cursor, Windsurf, Lovable, etc. you've experienced this firsthand. The model instantly generates 100+ lines of code, but there are instances where it inadvertently changes or breaks the existing code. This is why you spend most of your time verifying and manually approving / rejecting the changes.
Now, apply this to the physical world when Robots and Humanoids are making decisions and taking actions. The verification step will be even more critical.
Beyond hallucination, there are instances where the LLMs are just unable to take actions because they're dealing with something they've never seen before. Real world constantly keeps evolving - new rules and regulations, new objects. The data the model was trained on keeps getting updated.
As we stand at the cusp of an AI-integrated future, one thing becomes increasingly clear: AI will continue to need a human nudge.
The verification bottleneck that Andrej Karpathy identified isn't a temporary growing pain; it's a permanent feature of how we'll interact with AI systems. Whether it's reviewing code generated in seconds, approving a robot's next action, or helping a stuck delivery robot maneuver through a roadblock, the human element provides the contextual wisdom, ethical judgment, and real-world grounding that AI lacks.
This isn't a limitation to overcome, it's a partnership to embrace.