Generative AI bots have us playing different versions of the Turing test game.

Abstract

In 2026, we interact with generative AI constantly. There are open ended AI chatbots, coding assistants, documentation chatbots, social media bots, etc. Each interaction requires rapid decisions about when to trust AI outputs and when we're being fooled. This blog post explores one possible conceptual model for thinking about these interactions and organizing disparate data points and experiences into a framework to help us think more clearly. We take the parameters of the original Turing test, which was designed to measure if machine intelligence could pretend to be human, and vary the game parameters to map them to different real-world AI interactions that we encounter today. By viewing different AI scenarios through the lens of shifting game dynamics, we can better categorize and think about what is going on. Additionally, this blog post focuses on being fooled by AI and not other modes of framing failure, such as accuracy, lying, bullshit, or hallucinations, as "being fooled" covers a broader range of ways AI systems can fail humans and fits better into a game-based framing. Finally, we discuss limitations of this conceptual model.

Generative AI bots are everywhere in 2026

In 2026, generative AI is embedded throughout our daily digital lives. Customer service chatbots answer questions about our orders. Coding assistants suggest the lines of code as we type. Content generation tools help draft emails, create images, and even write essays. Social media is filled with bot-generated comments and posts. Each day, we navigate dozens of interactions with AI systems, some obvious, many subtle. They sometimes reduce small bits of toil so we can focus on more complex tasks, sometimes they work as multipliers, in some cases they replace work entirely.

This pervasiveness creates a peculiar kind of cognitive load. We constantly make rapid decisions about when to trust AI outputs, how to use these tools effectively, and when AI might be fooling us or giving us wrong information. A chatbot gives you an answer—do you trust it? When do you trust it? What should you be on the lookout for? A coding assistant suggests a dependency—is it suggesting something current and secure? A social media comment is generic and short—is it a bot? These aren't abstract questions. We face them already and over time will increasingly face them repeatedly many times a day.

We will have to continuously ask ourselves: "Am I being fooled by a machine right now?"

Using mental models to organize thinking

How do we think about all these different AI interactions systematically? When confronted with lots of many data points or pieces of information about something new, our minds naturally reach for metaphors and mental models. The right conceptual model doesn't just describe—it helps us navigate, predict, and make better decisions. It doesn't necessarily have to capture every nuance to be helpful, sometimes a model is useful just because it helps to organize observations about the world.

This blog post explores one possible conceptual model for thinking about generative AI interactions that I haven't seen written up elsewhere yet. It is written up here as an exercise to explore how well it holds up to scrutiny.

Using the Turing test as a model for how to think about being fooled by AI

Originally introduced in Alan Turing's 1950 paper "Computing Machinery and Intelligence," the test attempted to determine, via an imitation game, whether a machine can pretend to be human well enough to fool a human evaluator. This reframing sidestepped more philosophical and ill-defined questions such as "Can machines think?", "when is a machine conscious?", and "what is consciousness?", which are still very active questions today. In addition to replacing a fuzzy question with a more well defined one, the Turing test focuses squarely on the point where size and type of impacts change.

It is at this point, when humans are fooled, that machine pseudo-intelligence capabilities starts to become useful (and also troublesome) in the real world.

Specifically, Turing said,

"I believe that in about fifty years' time it will be possible to program computers, with a storage capacity of about 10^9, to make them play the imitation game so well that an average interrogator will not have more than 70 percent chance of making the right identification after five minutes of questioning. The original question, 'Can machines think ? ' I believe to be too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of word and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted."

By systematically varying the parameters of the original game, including the success criteria, we can map the different types of AI interactions we encounter today into different variations of that original framework. This creates a mental model to organize and categorize what can be otherwise a collection of confusing, isolated events.

The original Turing test parameters

In Turing's original proposal, a human evaluator engages in text-based conversations with two hidden participants: one human, one machine. The evaluator asks questions and tries to determine which is which based solely on their responses. If the machine can fool the evaluator a significant percentage of the time, it passes the test.

The original test had specific parameters:

Time: 5-minute conversation
Interaction type: back-and-forth dialogue where the evaluator asks both participants questions
Medium: Text-only communication
Players: One human, one machine, one evaluator
Topic: General conversation (though in some forms limited to whether either participant is human or machine)
Context: No external context; machine relies solely on its programming
Bot's goal: Fool the human into thinking it's human
Evaluator's goal: Determine which of the two participants is the machine
Awareness: Evaluator knows one is a machine, just not which one
Success metric: Percentage of time the evaluator is fooled

Turing's test was modeled off of a parlor game called the "imitation game," where an evaluator had to guess which of two hidden participants was a man and which was a women based solely on their written responses to questions.

Varying the parameters to map to different real-world generative AI scenarios

If you change those parameters, you're still playing an imitation game, just a different variation. The bullets below show different possible parameter values.

Time:
- Few seconds
- 5 minutes
- Ongoing conversations over weeks or months.
- We can also include in conversation length, meaning one sentence versus dozens of paragraphs.
Interaction type:
- One-shot output
- dynamic back-and-forth
Medium:
- Text-only
- voice
- multi-modal
Players:
- One human and one bot
- human + bot collaboration
- or other combinations
Topic:
- General
- Constrained to a specific domain
Context:
- Model only
- model + RAG
- model + internet search + conversation history + MCP etc.
Bot Goal:
- Fooling the evaluator
- Being useful
- Completing a specific task
Evaluator Goal:
- Catching the bot
- Getting help
- Evaluating an artifact
- Completing a task
Awareness:
- Evaluator knows it is conversing with a bot
- Evaluator does not know it's a bot
Success metric:
- Percentage of time the evaluator is fooled
- Task completion rate
- User satisfaction with answers from bot
- etc.

By mapping real-world AI interactions onto this parameter space, we can frame different AI experiences as different variations of a Turing test like game and better grasp what's actually being evaluated and when we're likely to be fooled.

Deeper dive into "being fooled" as a framing for generative AI failures

Alternative framings for understanding human-AI failures

This blog post focuses on game dynamics and "being fooled" as the framing for understanding generative AI failures. People also frame generative AI failures using terms like accuracy, hallucination, lying, and bullshitting. Each of these has its own merits and use cases.

Accuracy

Accuracy focuses on whether the AI's output is correct or incorrect. This is useful for tasks with clear right/wrong answers, but it doesn't work as well when users can be misled by factually accurate responses or when there are responses that don't map directly to true or false. For example, a bot might provide an accurate but irrelevant answer. As another example, a bot might generate code that runs without errors (accurate in that sense) but contains security vulnerabilities. The biggest advantage of the framing of accuracy is that it can be measured quantitatively leading to clearer success metrics and a pathway to performance improvements.

Lying

Lying emphasizes intentional deception by the AI. This is important in scenarios where the AI is designed to mislead users, such as in the original Turing test or disinformation campaigns. However, most generative AI systems are not programmed to Intentionally deceive. Additionally, AI systems do not have consciousness, thoughts, or intentions, so attributing lying to them can be misleading as it implies a level of agency that AI systems lack. To lie, you have to know what is true and intend to present something false instead.

Bullshitting

Bullshitting has been used at times as an alternative framing that focuses on the AI's tendency to produce plausible-sounding but ultimately meaningless or irrelevant responses. Bullshitting is an improvement on lying as a framing, because it captures the idea that the AI isn't necessarily trying to deceive as a goal, just generating content that sounds good without regard for truth or relevance.

One limitation of the bullshitting framing is that it can still be read to imply a level of intentionality or consciousness that AI systems lack. Another limitation of the bullshitting framing is that there is not a gradual spectrum of bullshitting. Either someone is bullshitting you or they are not. As AI bots have no concept of truth and lies, they are technically bullshitting all the time, no matter if the information is true or false or useful or not. This limits the usefulness of the framing as bullshitting doesn't point to any specific error or failure mode that can be addressed or measured.

Hallucination

Before applied to LLMs, hallucination was first used to describe the dream-like images AI models produced from early deep learning techniques. As an example, the "DeepDream" model generated surreal, hallucinatory images by enhancing patterns it detected in input images.

"Mona Lisa" with DeepDream effect using VGG16 network trained on ImageNet

"Mona Lisa" with DeepDream effect using VGG16 network trained on ImageNet. Created by P.J. Finlay using PyTorch DeepDream code by Aleksa Gordić. Derivative work based on Leonardo da Vinci's Mona Lisa (public domain). Image source: Wikimedia Commons

Hallucination was later adopted in the context of large language models to describe when they generate information that is not real. Similar to lying and bullshitting, hallucination captures the idea that the AI is producing content that doesn't correspond to reality. While it implies a less intentionality than lying or even bullshitting, it still can be a little misleading as it suggests the AI is experiencing something akin to human hallucinations. It also treats as out of scope all the failure modes that do not involve generating factually inaccurate information.

Being fooled

Being fooled is a broader framing that encompasses many of the above concepts. You can be fooled by something that is inaccurate, a lie, bullshit, or a hallucination. You can also be fooled by something that is accurate but misleading or irrelevant. Being fooled also does not imply intentionality or consciousness on the part of the AI. You can be fooled by something or someone that has intentionality, an inanimate object, or even yourself. Being fooled also fits well into the game play based framing of the Turing test as it can work with a wide range of success metrics.

Ways we can be fooled by generative AI

It is important to dive a bit deeper into what we mean by "being fooled" in the context of generative AI as the original Turing test had a more narrow usage of being fooled. The original Turing test had one simple task—catch the bot. Modern AI interactions are harder because we're juggling multiple tasks simultaneously. Being fooled by generative AI isn't just about thinking a bot is human. It's about failing at any of the multiple tasks we're managing when we interact with AI.

We can think of this along two dimensions:

Failure Modes — How you get fooled

There many ways to categorize how we might get fooled by AI. These 6 are ordered by such that even people who know they are talking to a bot and have some expertise in both the domain at hand and using AI tooling might still fall for the later ones:

Identity deception: Thinking the bot is human when it's not (classic Turing test scenario)
Attribution deception: Can't tell if work was done by human alone or human + bot collaboration
Topic expertise gap: Lacking the skill or ability to evaluate whether the bot's output is actually correct
Capability assumptions: Wrong assumptions about what the bot can or can't do (either underestimating or overestimating based on poor prompting or outdated knowledge of model capabilities)
Hallucination misses: Not understanding when these are more likely to occur due to context, prompt, or model limitations
Attention/verification failures: Knowing the hallucination risk types exist but failing to check, being distracted, or skipping verification steps

Task Complexity Cognitive load from managing AI interactions can sneak up

In the original Turing test, there was one main task: catch the bot. In many of the AI-human interactions we have today, there are multiple tasks going on simultaneously. Users know they are talking to a bot, and their many task might be writing code, looking up information, improving their writing, etc. However, they at the same time have secondary, tertiary, and even quaternary tasks going on in their head as they have be concerned about preventing each of the failure modes above from happening.

These types of AI interactions may require the user to juggle multiple tasks simultaneously resulting in an increased cognitive load.

Example games

The case studies below, each framed as a different game, highlight different combinations of game parameters.

Game 1: Social media bots—The abbreviated Turing test

Parameters varied:

Time: Very short (1-2 sentences that can be read in seconds)
Interaction type: One-shot or minimal back-and-forths
Medium: Text (social media posts/comments)
Players: One or more bots, many human readers
Topic: Constrained to political topics or news (emotional reactions, short opinions)
Context: Often only model and text in the thread, sometimes additional context brought in to guide text generation towards what will get engagement
Bot Goal: Fooling (appear human enough to not be detected) & amplification (spreading a message)
Evaluator Goal: Not actively trying to catch the bot (just scrolling)
Awareness: Evaluator doesn't know it's a bot
Success metric: Percentage of users who don't recognize it as a bot and are manipulated by it

Game dynamics

The main difference between social media bots and Turing's original test is the drastically reduced time and interaction type. While a 5-minute conversation provides many opportunities for a machine to reveal itself through inconsistencies, lack of cultural knowledge, or unnatural phrasing, a single comment on Facebook or a brief exchange on Twitter? That's a much easier game for a bot pretending to be a person. As a result, these bots often optimize for brevity and vagueness. They post short, emotionally engaging comments that could plausibly come from a human: "This is so true!" or "Can't believe people still think this way." They avoid extended conversation that might expose them. When they do respond, replies are generic enough to fit most contexts and add little.

Additionally, the evaluator—someone scrolling through their feed—isn't for the most part actively trying to detect bots. They're not engaged in a deliberate evaluation exercise. This transforms the test entirely. The bots "win" by changing the game, making the interaction so brief that detection becomes nearly impossible for an evaluator that isn't entirely aware they're playing the game.

A Facebook politics bot is playing a variation of the Turing test where it uses abbreviated, mostly one-way, communications to shift the parameters of the game in its favor.

Winning this game

Social media bots vary in their sophistication. Some use simple templates and keyword matching, while others leverage advanced language models to generate more human-like text. Some have been around for years and are part of larger networks designed to amplify specific messages or sow discord. Others are likely months old and have impact through quantity not quality. While lower quality bot accounts are more likely to be suspected as bots as they might only post political content, post at inhuman hours, post at a well defined frequency (only at 5PM every day), or have other telltale signs, even these bots are hard to definitively identify as bots due to there being a number of real human users that post in the same way.

Winning this game as a user can often mean not playing. Especially on news sites, igorning the comments section entirely or only reading comments of more than five sentences can help avoid bot interactions.

Game 2: Essay writing in educational settings—The non-interactive test

Parameters varied:

Time: Not applicable (asynchronous artifact evaluation)
Interaction type: One-shot (single artifact produced)
Medium: Written text (static document)
Players: Human + bot collaboration (not just bot alone)
Topic: Academic subject matter
Context: Model + potentially research materials
Bot Goal: Assisting with writing/research
Evaluator Goal: Determine if this is the student's work or if a bot was used to produce the work
Awareness: Evaluator doesn't know if it's AI-assisted
Success metric: Whether teacher can detect AI assistance and/or whether learning occurred

Game dynamics

Another variation of the Turing test arises in educational settings where students use AI tools to help write essays or complete assignments, and those AI tools are not allowed. They might not be allowed because the educator wants to ensure the student understands the material. Early studies suggest heavy users of ChatGPT to help with writing assignments tend to perform worse on follow-up assessments of how well students understood their own writing.

In this scenario, the question being evaluated has also shifted. Turing's test asks: "Is this a human or machine?" The evaluator is now asking: "Did this student write this essay themselves, or did they use AI assistance?" The student may have used AI assistance collaboratively to write or the student could have even fully generated the essay using AI. Even in cases where the writing was fully generated by AI, the student might have edited or curated the output to some degree. The essay isn't purely bot-generated or purely human-written—it's a hybrid.

Additionally, the teacher is not engaging in a back-and-forth conversation with a participate that could be a human or a machine. Instead, they are evaluating a static artifact (the essay). There's no back-and-forth, no opportunity to probe with follow-up questions. This makes it harder to detect machine involvement.

An essay written with AI assistance is playing a variation of the Turing test where the evaluator must judge presence and extent of hybrid authorship from a static artifact. The one-way static interaction and hybrid authorship make evaluation challenging.

Winning this game

Although there are services that claim to detect AI-generated text, their accuracy is limited. As a result, any use of them will result in some false positives and false negatives, making programmatic enforcement of AI-free writing assignments policy fraught with ethical challenges.

Shifting homework assignments to formats and tasks that are less amenable to AI assistance is one approach educators are taking. Assignments can also be constructed to allow for usage of AI-based tools, but in specific and monitored ways. These approaches try to limit the need to "catch the bot" altogether. Users win by not playing the game.

Game 3: Documentation chatbots—When everyone knows it's a bot

Parameters varied:

Time: Short interaction (minutes)
Interaction type: Dynamic back-and-forth Q&A
Medium: Text (chat interface)
Players: One bot, one human user
Topic: Constrained domain (company's documentation/policies)
Context: Model + RAG (retrieval from documentation)
Bot Goal: Being useful (solving user's problem)
Evaluator Goal: Getting help (not catching the bot)
Awareness: Evaluator knows it's a bot from the start
Success metric: User problem solved, user satisfaction/frustration, and/or human time saved

Game dynamics

You've clicked through to a company's help center, and a chat window pops up. "Hi! I'm here to help. What can I assist you with today?" You know immediately it's a bot. There's no deception, no attempt to pass as human. The Turing test premise, fooling someone about your nature, simply doesn't apply.

Instead, you're playing a different game entirely. The question isn't "Is this human?" but "Can this bot solve my problem faster than finding a human would?" Success is measured by utility, not believability. The bot accesses the company's documentation, retrieves relevant passages, and attempts to answer your question. If it works, great. If it fails, you try to find the documentation site or a pathway to asking a human.

The chatbot is constrained to a specific domain—the company's policies and procedures. It doesn't need to discuss philosophy or pass as human in general conversation. It just needs to be useful within a narrow scope. This reduction in scope makes the test easier for the bot.

A documentation chatbot is playing a variation of the Turing test where the evaluator knows it's a bot and the goal is utility within a constrained domain, which makes success easier to achieve.

Success often depends on how well the documentation aligns with the user's question and how effectively the bot can leverage the documentation to solve user problems efficiently. The later is impacted by prompt engineering, retrieval quality, and model capabilities. If relevant content is split across multiple documents or phrased differently than the user's question, the bot may struggle to piece together a coherent answer.

User frustration can occur when the bot clearly doesn't give any useful answers and there's no easy escape hatch to manually look through documentation or contact a human. This can be especially frustrating when there used to be a human available but now it's just a cheery but entirely time-wasting bot. People can find themselves forced to play a game they never signed up for.

Winning this game

One strategy for winning this game as a user is to be very specific and clear about your question. By providing context, you can help the bot retrieve more relevant information.

Another strategy is to use different words to describe the same problem. Documentation is often written from the perspetive a user going though a specific workflow. If you're trying to start at a different point in that workflow or have a slightly different use case that leverages the same tool or policy, the bot may struggle to connect your question to the right documentation.

Similarliy, looking at the structure and phrasing of the documentation and then go back to prompting the chat bot may help as it can align your phrasing with the documentation better.

Game 4: Coding assistants—Speak to it like a human, but remember the many ways it is a bot

Parameters varied:

Time: Ongoing relationship (days/weeks, not minutes)
Interaction type: Dynamic back-and-forth over multiple sessions
Medium: Text (code editor or chat interface)
Players: One bot, one developer
Topic: Code generation, debugging, technical questions
Context: Model + full conversation history + code context
Bot Goal: Being useful in coding tasks (responsive to natural language)
Evaluator Goal: Doing task (getting work done efficiently) while maintaining skepticism (not getting fooled)
Awareness: Evaluator knows it's a bot
Success metric: Task completion, does code run, developer productivity, code quality, code security, etc.

Game dynamics

Developers using GitHub Copilot or Claude for coding navigate a peculiar cognitive challenge. In some ways, this use of generative AI has probably been the most successful, at least in terms of revenue, adoption, productivity impact, and integration into daily workflows. The reasons for that success are many but include large amounts of training data, early integration into existing tools, built-in ways to test for first-level success (does the code run?), and capabilities for bringing in context that narrows the bot's possible outputs such that it tightly fits the user's needs.

Similar to documentation chatbots, developers know they're talking to a bot. There's no deception. The challenge is different: how to get the most useful responses while avoiding being fooled by the bot's limitations. This creates a dual mental model. The developer must:

Communicate with the bot as if it were human (to optimize its helpfulness)
Constantly remember it's a bot with bot flaws (to avoid being fooled)

Part of the reason that can be hard is that developers have learned that talking naturally to the AI—asking questions politely, providing context as they would to a colleague produces better results. The AI responds more helpfully when addressed in conversational human language. Yet simultaneously, they must always be on look out for the bot-specific failure modes mentioned above in the section Ways we can be fooled by generative AI.

The AI can generate plausible but incorrect code that doesn't run. It can also suggest outdated or insecure dependencies, as I discussed in my post on using instruction files to help Copilot avoid risky dependencies. It can provide confident explanations that are completely wrong. It can misunderstand your intent, especially when context is incomplete, leading to wasted time re-prompting or fixing errors. For developers using code-assistant bots, managing that additional cognitive load to avoid being fooled and even changing their workflows to minimize those errors is the cost of the increased productivity from the coding assistants.

Developers have to learn which tasks it handles reliably and which require human verification. They have to develop a sense of when to trust its suggestions and when to double-check as well as when those checks take more time than the benefit of using the bot at all for that task.

They also have to keep up with changing model capabilities. A task that an AI assistant might have handled terribly a few months ago might work great today. Similarly, in an coding assistant like GitHub Copilot, there are multiple models available to pick from. Some of those models might be better at certain tasks or prompts than others.

Context engineering has also become both a skill and coding assistant product feature. Bringing in additional context for the bot to work from in addition to the prompt users put into the chat window can dramatically improve results. This includes other code files, related repositories, and external documentation. It also includes pre-written guides in agent.md files or instruction files (as described in a previous blog post) that help guide the bot's behavior when doing specific tasks. Bringing in these additional context sources helps reduce hallucinations and improve accuracy as it limits the bot's possible outputs to those that are more likely to be correct.

Using generative AI coding assistants is playing a variation of the Turing test where the developer must help the bot be as human as possible to get the best results, while still in the back of their head remembering all the different ways generative AIs bots can fail.

Winning this game

As there are many different success criteria for coding assistants that must be balacned simultaneously, winning this game involves multiple strategies. To avoid writing a book on just this topic, here are the high-level strategies:

Context, context, context:
- Use clear, specific prompts that provide context about what you're trying to achieve.
- Leverage instructions.md, agents.md, MCPs, and other context files to help guide the bot's behavior.
  - Healthy dependencies
  - Coding style
  - Testing requirements
  - Security reviews
Change workflows to align with AI tooling:
- More use of spec-driven development to help guide AI outputs.
- More tests to ensure functionality works as expected.
Keep considering bot limitations and the failure modes mentioned above:
- Think about where using the bot might take more time than it sames.
- Think about when there is assumptions, experiences, preferences, or best practices in your head that aren't in anything the bot has access to.
- Think about where you might not be able to recognize errors due to topic expertise gaps. What is your plan to address those risks?
- Think about when APIs, functional calls, or other parts of the code might be more likely to be hallucinated.
- Always review code before merging.
- Think about how to verify code outputs effectively.
- Test generated code.

Discussion

This blog post was an experiment to see how well the Turing test framework could be adapted to help think about modern generative AI interactions.

Weaknesses of this conceptual framework

Mapping real-world scenarios onto the parameter space takes too much effort

Chief among the limitations is that it takes some thought to map real-world scenarios onto the parameter space. In comparison, consider this other conceptual model that only takes a sentence to explain:

"How would the world react if hammers were invented only a few years ago."

People can immediately start to apply that mental model to different aspect of AI and society. For example, there would be more than a few social media posts about how people had hit themselves in the face with a hammer and therefore hammers were dumb and pointless. Another example is there would be a lot of debate on hammers' economic and labor implications. How would we know if we're in a hammer bubble? There might also be a steady evolution of hammer design and associated tooling as people figured out what worked and what didn't.

Not everything that matters is within the AI-user interaction scope that is used to define the game play

This conceptual framework's focus on "being fooled" in a game dynamic context effectively treats as "out of scope" many important aspects of human-AI interaction. The framework assumes rational evaluators, whereas in reality, human biases, perceptions, and emotions can significantly impact both user interactions with AI as well as perceptions of success or failure. The focus on a singular "game" can oversimplify complex interactions that involve aggregate effects of many games being played. For example, what tasks are targeted may change as different tasks become easier or harder with AI assistance. As another example, if certain coding tasks are easier for non-experts, that may shift who is doing those tasks with knock-on effect in terms of team composition, security practices, workflows, skills required, etc. Taking this ideal further, when millions of games are played, the aggregate effects on job markets, environment, environment impact, and economic structures could be significant.

This conceptual model is somewhat myopically focused on one user and their immediate task at hand.

Benefits of this conceptual framework

The main benefit of this conceptual framework is that it helps organize thinking about different generative AI and user interactions as variations of the same fundamental game play rather than isolated scenarios. Secondly, the focus on "being fooled" helps capture a broad range of failure modes that go beyond simple accuracy or hallucination concerns.

Conclusion

We're not playing the original Turing test in 2026, we're playing many slightly different Turing-test-like games. Each variation requires different expectations and strategies. A social media disinformation bot, a documentation chatbot, and a coding assistant are all "AI," but they present fundamentally different challenges and game dynamics. Understanding which variation you're in can help you, as the user, design better strategies for playing each game effectively.

*Generative AI was used to help write this blog post, specifically for exploring alternative structures and phrasings.