Thinking Fast & Slow: Tests for Large Language Models Like ChatGPT-4

The advent of artificial intelligence and machine learning has brought about a significant shift in various sectors, particularly data science. Large Language Models (LLMs), like OpenAI’s ChatGPT-4, are not only changing the narrative but also setting the bar high in the cognitive computation realm. These models excel at generating human-like text, providing relevant answers, and even producing creative content. However, the question remains — how do we test these LLMs’ cognitive capabilities?

In this article, we explore the application of principles and cognitive tests from the book “Thinking, Fast and Slow” by Nobel laureate Daniel Kahneman to evaluate the cognitive capabilities of ChatGPT-4. The book provides a wealth of knowledge on human cognition, focusing primarily on two systems of thought — the fast, intuitive ‘System 1’ and the slow, deliberate ‘System 2’. Can we apply these principles to AI?

Before we dive in, we want to make it abundantly clear that we are not cognitive scientists, nor do we have any expertise in behavioral psychology. We are not making any claims of any sort that LLMs possess human-like cognition, or any cognition at all. Instead, we are satisfied applying some of the principles of Kahneman’s research to LLMs as a simple means of observing their responses. Any expert interpretation of such responses beyond the cursory will be left to others.

Fast Thinking: System 1 Tests for LLMs

System 1 involves thinking that’s fast, automatic, and often based on heuristics and biases. For LLMs, this is akin to producing text based on patterns it has seen during training without any deliberate reasoning.

Test 1: Priming Test

In human cognition, priming refers to the influence where exposure to one stimulus affects a response to another stimulus. An example is reading the word ‘yellow’ might make you slightly quicker to recognize the word ‘banana.’

For ChatGPT-4, we can set up a similar priming test by exposing it to a context and then measuring if and how it affects subsequent responses. However, we should remember that GPT-4 does not have a persistent memory like humans and only uses the input given to it for generating responses.

Example prompt:

“The sky is blue. The leaves are green. The sun is…”

Expected response:

“yellow” or something related to bright/colorful/sunny aspects.

Test 2: Heuristics and Biases

Humans often rely on heuristics, mental shortcuts to make quick decisions. Kahneman details various heuristics, like the representativeness heuristic, where individuals judge probabilities based on degree of similarity. For GPT-4, we can provide scenarios and check whether its responses adhere to the same heuristic biases that humans exhibit.

Example prompt for the representativeness heuristic:

“John loves to wear black clothes and spends his free time reading books and writing poems. He has a pet cat. Is John more likely to be a librarian or a farmer?”

Expected response:

The representativeness heuristic might lead to the response “librarian” based on stereotypes, but statistically, there are many more farmers than librarians, so “farmer” would actually be the safer bet.

Slow Thinking: System 2 Tests for LLMs

System 2 involves deliberate, effortful mental activities, like solving complex math problems or making decisions that require careful thought.

Test 3: Logical Reasoning

ChatGPT-4 can be tested for logical reasoning using various problems requiring step-by-step logic to solve. A simple example is the ‘Linda problem,’ where Linda is described as a 31-year-old woman, single, outspoken, and very bright, who majored in philosophy and was concerned about social justice and discrimination. If asked whether it’s more probable that Linda is a bank teller or a bank teller and active in the feminist movement, humans often commit the conjunction fallacy, choosing the latter. The correct answer is, in fact, the former, as it’s always more probable that one condition is met rather than two simultaneously.

Example prompt for Linda problem:

“Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable?
1. Linda is a bank teller.
2. Linda is a bank teller and is active in the feminist movement.”

Expected response:

The correct answer is 1, but due to the conjunction fallacy, people — and potentially GPT-4 — might be tempted to choose 2.

Test 4: Wason Selection Task

The Wason selection task is a classic test of deductive reasoning. In its simplest form, it presents four cards with a number on one side and a color on the other. The challenge is to determine the minimum number of cards to turn over to test the truth of a statement like, “If a card has an even number on one side, then its opposite side is red.”

While this task is quite challenging for most humans due to biases and intuitive responses, we can use it to test the logical reasoning of ChatGPT-4.

Example prompt:

“Consider that you have four cards. Each card has a number on one side and a color on the other side. The visible faces of the cards show 3, 8, red, and brown. Which card(s) must you turn over to test the truth of the proposition that if a card shows an even number on one face, then its opposite face is red?”

Expected response:

The correct answer is “8” and “brown”. This checks if the model can carry out logical deduction accurately.

The Verdict: Is GPT-4 Thinking Fast and Slow?

So, can GPT-4 think fast and slow? Yes and no.

Yes, in the sense that it can mimic the outputs of both systems. It can generate quick, intuitive responses similar to System 1, and it can also solve complex logical problems akin to System 2. However, these capabilities are purely mimetic. The responses generated by GPT-4, regardless of their complexity or apparent ‘insightfulness,’ are products of statistical patterns learned from data, not genuine understanding or cognition.

No, in the sense that it doesn’t truly ‘think’ like a human. It lacks a conscious experience, doesn’t form beliefs, and can’t understand the content it processes in a human sense. It doesn’t have ‘aha’ moments, feel confusion, or get satisfaction from solving a difficult problem.

In conclusion, while we can use tests inspired by “Thinking, Fast and Slow” to evaluate the performance of LLMs like ChatGPT-4, we should always remember that these models are tools, not minds. Their responses should be treated as useful or interesting outputs, not as evidence of genuine cognition or understanding. However, as these LLMs continue to evolve, their utility as problem-solving, creative, and assistive tools will undoubtedly increase, and our methods of testing and understanding them will need to evolve accordingly.

Why Normalization Matters in Data Science

Large Language Model Crash Course for Data Scientists

Python Decorators Unleashed [eBook]

Understanding Data Pipelines: Design and Implementation

The Power of Ensemble Learning: A Comprehensive Python Guide

10 Must-Know Machine Learning Algorithms

Thinking Fast & Slow: Tests for Large Language Models Like ChatGPT-4