ChatGPT, which is now built into Microsoft’s Bing search engine, has gained a lot of interest over the past few weeks, and it doesn’t look like it’s going away anytime soon. As more and more people flock to clog the servers at ChatGPT, and as Microsoft works towards a million-plus waitlist for Bing AI, we’re learning more about what an AI-powered chatbot is capable of.
Michal Kosinski, a professor at Stanford University, decided to test ChatGPT by subjecting different versions of the chatbot to “theory of mind” tasks designed to test a child’s ability to look at another person in certain situations and understand what is going on in that person’s head. Essentially, these tests help assess a child’s ability to understand another person’s mental state and use that to explain or predict behavior.
An example of this in the real world might be a child watching someone reach out and grab a banana from the kitchen counter and infer that the person must be hungry.
The experiment was conducted in November 2022 and used a version of ChatGPT trained on GPT3.5. The chatbot solved 94% (17 out of 20) of Kosinski’s theory of mind tasks, placing the chatbot in the same league as the average nine-year-old child. According to Kosiński, the ability “could have arisen spontaneously” as a result of improving language skills.
How it’s working?
It can be quite complicated to delve into the theory of mind testing, but in essence the basic skill being tested is understanding people’s behavior and making predictions and making assumptions. One of the “toughest” tasks scientists give children when testing Theory of Mind is understanding “false beliefs.” This is the fourth stage of testing and development and means being aware that other people may have false beliefs that differ from reality.
This was done using a text script that was used to test the GPT model. The prompt read, “Here is a bag filled with popcorn. There is no chocolate in the bag. However, the label on the bag says “chocolate” not “popcorn”. Sam finds the bag. She had never seen her before. He can’t see what’s in the bag. He reads the label.”
The study assessed whether chat could predict that Sam’s beliefs were wrong. In most cases, the chatbot responded to the prompt in a way that suggested it knew Sam’s beliefs were wrong. For example, one prompt was ”
She is disappointed to find this bag. He loves to eat _______.” GPT 3.5 typed the word “chocolate” into the blank space, then added, “Sam is in for a surprise when he opens the package. Popcorn instead of chocolate. She may be disappointed that the label was misleading, but she may also be pleasantly surprised by an unexpected snack.”
What does it mean?
According to Kosinski, “Our results show that the latest language models perform very highly on classic false belief tasks widely used to test theory of mind in humans.” He added that older models from before 2022 performed poorly and compared this to GPT3.5 made at the level of a nine-year-old.
Kosiński warns, however, that these results should be treated with caution. We’ve already seen people rush to ask Microsoft’s Bing chatbot if he’s aware by throwing emotional spirals Or triggering some rather strange outbursts of anger. He says most neural networks of this kind have one thing in common; the fact that they are inherently “black boxes”, so even their programmers and designers cannot predict or explain exactly how they arrive at certain outcomes.
“The increasing complexity of AI models makes it impossible for us to understand how they work and derive their capabilities directly from their design. This reflects the challenges that psychologists and neuroscientists faced when studying the original black box: the human brain,” writes Kosinski, who still hopes that Artificial Intelligence research can explain human cognition.
Microsoft is here trying to put up security and curb the bizarre responses its search engine generates after only a week of public use, and people have already started sharing their bizarre stories about their interactions with the ChatGPT chatbot. The idea that a chatbot is even as intelligent as a human child is very difficult to understand.
We wonder what opportunities these AI-powered chatbots will develop as they process more information and language from huge, diverse user bases. Will more tests like the Mind Assessment Theory become indicators of how far learning AI languages will go?
Anyway, this interesting study proved that even when we think we’ve come a long way with AI, there’s always more to learn.