Michael Demidenko

Michael Demidenko

The hype doesn’t live up to the results.

Disclosure: I’m not an unwavering advocate for LLMs. They offer advantages but also have drawbacks. I use them on a case-by-case basis, too. Finally, the below is not to discredit the time and energy the authors put into this work. It is a beefy paper with many analyses.

Your Brain on ChatGPT Preprint

The Bottom Line

Key Takeaway: The viral MIT study claiming AI causes “cognitive debt” suffers from severe methodological flaws that make its sweeping conclusions scientifically unjustified. The “brain proof” that AI makes us dumber simply isn’t there.

The Claims vs. The Evidence

The article from MIT “Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task” has received massive attention across social media platforms, with research gate showing over 24,000 reads.

Preprint’s Main Takeaway: We believe that some of the most striking observations in our study stem from Session 4 and the pressing matter of a likely decrease in learning skills based on the results of our study.

The social media narrative is that this paper proves AI is making our brains “dumber” and causing “long-term cognitive debt” (Psychology Today). The brain data are supposedly proof! But are these dramatic claims actually supported by the evidence?

The Reality? There are fundamental flaws in the study and measurements that undermine claims. These problems include but are not limited to:

  • Small, biased samples: Only 18 per group in main sessions, just 9 per group in the crucial Session 4 (see small samples issue in neuroscience)
  • Self-selection bias: The key findings come from participants who voluntarily returned for an “optional” 4th session (e.g., not part of original experiment)
  • Flawed brain interpretations: Assumes higher neural connectivity = better cognition, when it could equally mean inefficiency or greater effort (see ‘reverse inference problem’ in Poldrack, 2006)
  • Artificial task with no real-world validity: Participants wrote SAT essays with no learning targets, nothing like actual LLM usage. 
  • Poor experimental control: Participants could deviate from assigned tools however they wanted (i.e., can be in LLM group but use it only for proofreading. High degree of heterogeneity)

The broad generalizations about “cognitive debt” are scientifically unjustified. The main takeaway, especially from Session 4, cannot be supported with confidence. Just like a previous study didn’t actually “prove that your brain needs breakswith EEG results, this paper has interesting data but its grand conclusions are not supported by the evidence.

What the rest of the post covers.

As an expert in Statistical Analyses, Cognitive Neuroscience, Measurement, Study/Experimental Design and Research Methods, I’ve read the entire paper to give you the real story. This detailed evaluation breaks down:

  1. Study Design: The basic experimental setup and why it matters for interpreting results
  2. Stated Goals: What the researchers claimed they were testing
  3. Notable Issues: The specific methodological and conceptual problems that undermine the conclusions

Throughout, I’ll use direct quotes and page references from the original paper to show you exactly where the problems lie.



Enhance Your Research

Whether you have questions about this review or need expert consulting on research design, experimentation, neuroscience, behavioral analysis or more, I’m here to help transform your ideas into rigorous, reproducible, impactful research.

Reach Out →

Study Design

Three Groups
  • LLM group: Used a tool, essay
  • Search Engine group: Used a tool, essay
  • Brain-only group: No tool, essay
    • Brain-only: “use your own knowledge”
  • Essay choice: SAT – 3 topics per session, 9 total

Sessions

sessions brain on chatgpt
Illustration of participants across sessions
  • Session 1–3 required to complete experiment
    • All groups constant through sessions 1–3
  • Optional: Session 4, dependent on availability/scheduling
    “Participants were assigned to the same group for the duration of sessions 1, 2, 3 but in case they decided to come back for session 4, they were reassigned to another group.”
    • LLM group: No tool allowed
    • Brain-only group: Given LLM tool
    • Use of tools was restricted based on assignment (pg. 27)
Participants
  • Recruited (1–3): 54
  • Session 4: 18
demographics brain on chatgpt
Figure from Kosmyna et al. Preprint, Figure 3 in preprint
EEG Constructs
  • Cognitive engagement
  • Cognitive load
  • Bands: Alpha, Theta, Beta
Essay Scoring
  • Human teachers
  • AI agent “judge”

 

Goal of Work

  • Explores the cognitive cost of using an LLM while performing the task of writing an essay
  • Essay rationale: “Cognitively complex task that engages multiple mental processes while being used as a common tool in schools and in standardized tests of a student’s skills”
  • Evaluates the concern that “emerging research raises critical concerns about the cognitive implications of extensive LLM usage… diminish critical thinking capabilities and lead to decreased engagement in deep analytical processes” (pg. 10)

Key Research Questions (pg. 11)

  • Do participants write significantly different essays when using LLMs, search engines, and their brain-only?
    • tl;dr: Somewhat supported (more below)
  • How do participants’ brain activity differ when using LLMs, search, or their brain-only?
    • tl;dr: Many analyses. Likely spurious and post-hoc interpretations.
  • How does using LLM impact participants’ memory?
    • tl;dr: Relatively stable and LLM/search engine group underperform somewhat for sessions 1–3 based on “Ability to Quote”.
  • Does LLM usage impact ownership of the essays?
    • tl;dr: Somewhat.

Observations and Concerns

Odd Positioning of Literature Review
“Search as Learning” Framework
  • The framework highlights how web searches can serve as “powerful educational tools” when approached strategically
    • SAL emphasizes the “learning aspect of exploratory search with the intent of understanding” (pg. 13)
  • This takes what was originally perceived in the introduction of search engines as “bad” and spins it into a positive. A similar position can be thought about sophisticated and thoughtful prompt engineering approaches—how users can get more out of LLMs with better prompts and clear requests (e.g., Google effect)
    • Users must engage in iterative query formulation, critical evaluation of search results, and integration of multimodal resources” (pg. 13)
  • Assumption issue: This assumes users engage thoughtfully and do not simply pick a top-ranked source while engaging in distractions of “social media” in the search engines. Thus, they may incur bias similar as do LLMs. But if users develop skills to probe further via prompts, it would be equivalent to scanning beyond the first article.
Cognitive Load and Web Searches
  • During query formulation, users must recall specific terms and concepts, engaging heavily with working memory and long-term memory to construct queries that yield relevant results” (pg. 14)
  • This still holds true for LLMs, and it is user-dependent
  • Relies on Google fact-checking
Physiological Responses
  • Through fMRI, it was found that experienced web users, or ‘Net Savvy’ individuals, engage significantly broader neural networks compared to those less experienced, the ‘Net Naïve’ group [51]… This broader activation is attributed to the active nature of web searches” (pg. 17)
    • The cited study is an fMRI exploratory, whole brain analysis, showing that more activity in “savvy” older adults than non-savvy older adults. It asked: Are there fundamental differences between users and non-users of internet when age/education is similar?
    • Interpretation issue: More activity doesn’t necessarily mean “better.” Activity is computed via a contrast, so it represents a difference in activity related to an evoked response. More activity can sometimes refer to “less energy required” or “more energy required”, there is no easy structure for good/bad distinction here.
    • The cited study was also small, 12 vs 12 ~65-year-olds and had design issues, whereby active versus passive design don’t isolate the critical process the original authors were interested in
  • The authors heavily focus on dorsal ACC findings. 
    • arguments regarding the dACC being selective to some processes more than others has been argued by Tal Yarkoni (here) and the false positives in spatial hypotheses for ACC as noted in Hong 2019 (here). This makes it challenging to pinpoint false positives and replicability of findings.
  • In any case, the theoretical premise for why the current design should find evidence that may allude to deficiencies in brain function appears cherry-picked. 
Methodological Concerns
Poor Group Assignment
  • How were individuals assigned to groups?
  • Almost 2x more females than males recruited. Despite indicating they were balanced across groups, e.g. “were randomly assigned across the three following groups, balanced with respect to age and gender” (pg 23), still imbalanced due to more females and mostly undergrads
  • What is the distribution of the demographics in session 4 and does “availability” bias the conclusions.
Prior to Study Use: Patterns of LLM use*
  • Figure 30 shows that “no response” == 100%, daily == 20%, 2-3x week 15% and 10% from time to time for the rest… hard to understand why “no response” is 100%
  • Figure 32 shows that when people switch groups they use GPT differently. So adoption of use and self-selection can bias the prior results
Recruitment and Attrition*
  • Why 54 participants and how were they recruited? Why 4 sessions over 4 months and who dropped out and why?
    • While stated that recruitment occurred over 4 months, did they come back in a consistent manner?
  • There were 60 recruited; due to scheduling problems, 55 completed the experiment in full, but only report data on 54 across 3 sessions. What happened to 55th participant?
  • Wide age range: 18 to 39, with 35 undergrads, 14 postgrads, and 6 masters/PhDs (Figure 2)
  • Northeastern, Tufts, Harvard: most males recruited; Mostly female. Unclear imbalance/poor sampling
Posthoc Session Structure
  • If the experiment was 1-3 sessions, why was a an “optional” 4th session analyzed?
    • Each participant attended three recording sessions, with an option of attending the fourth session based on participant’s availability. The experiment was considered complete for a participant when three first sessions were attended. Session 4 was considered an extra session” (pg. 23)
  • Given the risk of self-selection bias, timing and sample size (9 per group in session 4), is it appropriate to interpret these findings as the most important in the summary statement?
    • If anything, Session 4 results should have only been mentioned in passing in the discussion or excluded entirely. Instead, the authors LEAD with the findings in their summary. 
Potential Learning Effects
  • Does the Brain-only group have a learning effect that carry-over into session 4?
  • While it says participants were forbidden from using search engines and LLMs during study dependent on the group assignment, there is a lack of evidence how much users did/did not use tools outside of the experiment. For example, if I am studying the effects of exercise on cardiovascular health over 6-months. My groups during the study either 1) do or 2) do not engage in exercise. However, the participants can exercise outside of the protocol between sessions. What can my experiment say about the effects of exercise on cardiovascular health?
Task Time Constraints*
  • Speed completion: After calibration, Stage 4 writing task lasted 20 minutes. 
  • EEG was short, so there were strict time constraints that add extraneous variables of stress/pressure. 
  • Some participants self-reported this in the debriefing interview:
    • Time pressure occasionally drove continued use: ‘I went back to using ChatGPT because I didn’t have enough time, but I feel guilty about it‘” (pg. 38)
Memory/Quoting Results  
sessions effects - brain on chatgpt
Figure based on data reported in Kosmyna et al. Preprint

The lack of targets for users is made clear in the paper. Which is clear why quoting was poor in Session 1, but at follow-up users understood the question may arise, so in Session 2:

  • ABILITY TO QUOTE: “Unlike Session 1, where the quoting question might have caught the participants off-guard, as they heard it for the first time (as the rest of the questions), in this session most participants from all the groups indicated to be able to provide a quote from their essay.” Quoting ability was 18/18 for brain-only and 2/18 for LLM/Search engine reported challenges in quoting (NOTE THE FLIP IN REPORTING TO MAKE IT APPEAR WORSE—clever!)
  • CORRECT Quote: LLM had 4 participants that couldn’t quote and 2 who were not able to provide a quote in groups 2/groups 3 (note how they refer to groups here, makes it harder to track due to inconsistency in labels)
  • Quoting performance seems to improve from Session 1 and stay constant for sessions 2/3, whereby 4-6 LLM users are unable to provide correct quotes and 2 unable from each of the other 2 groups

Session 4

  • The distinct shift in quoting ability in reassigned groups is uninterpretable because the performance was relatively consistent between groups prior. So users may have become more familiar with the approach before and so performed worse. If anything, session 4 indicates that going from LLM to Brain-only group does not result in the deficiency in thinking as they were able to adjust and still quote relatively well
  • Interestingly, the search engine to the other two groups isn’t clearly reported in the figures (pg. 38-39)

Multi-staged Consolidation in Memory

  • Brain only: Users have to come up with, write and read 
  • Search engine: Users have to search and find, write and reread
  • LLM: Users can get a quote and paste it into text and read.

Cognitive Debt/loading or Different Strategies?

Neural Activity Interpretation

[I have worked with EEG, but my expertise is in fMRI/MRI. Nevertheless, the between field concepts and methodological issues hold. However, there may be things I missed in my quick review]

 

EEG Graphic from Brain on Chat GPT
Graphic from Kosmyna et al. Preprint

EEG Results

  • Different neural connectivity patterns observed across several regions in the alpha, beta and theta band.
  • Brain connectivity systematically scaled down with the amount of external support
  • In session 4, Brain-to-LLM participants showed higher neural connectivity than LLM Group’s sessions 1, 2, 3 (network-wide spike in alpha-, beta-, theta-, and delta-band directed connectivity)
  • dDTR measure looked at activity in all sessions, including the self-selected Session 4

Concerns

  • Commits the classical issue of reverse inference, further complicated by the idea that there is a “causal” structure in what is largely covariation in a frequency band between electrodes.
  • Why is MORE neural activity associated with BETTER performance? High connectivity and/or lower connectivity does not directly mean better or worse, it simply means different. The authors could have easily argued that less activity is MORE efficiency (similar issues exists in fMRI, whereby researchers cherrypick when greater activity is “better” and when less activity is “worse”).
  • Does a difference of differences address the core question? Why would we be surprised that there are differences in how people engage with different types of tasks during each session?
  • It seems like the repeated measures ANOVA controlled error rate (FDR) only for between group comparisons but not across the multiple between group combinations? 
  • This EEG method relates to fMRI resting state analyses in several ways. dTR used a multivariate analysis that attempts to distinguish the causal structure of activity between regions/locations in the frequency domain in EEG (which would be the time domain in fMRI). In fMRI this, conceptually, is related to GIMME which uses vector autoregression across windows of time to get the directed connections between regions (and in multiple solution GIMME (this study), the directed multi-path graphs via Granger causality (see a previous analysis Demidenko et al using GIMME here). Unlike dDTR which uses global fits such as AIC, GIMME focuses on more model building statistics such as RMSEA, CFI, TLI, SRMR. So they both use on distinct model fit statistics. In model fitting approaches, similar fit statistics are possible for numerous competing models. 
  • While the authors claim “Alpha band connectivity is often associated with internal attention and semantic processing” using prior literature, it is not confirmed via questionnaires in the data. Given this is not a registered report (Chambers & Tzavella 2021), this is susceptible to many p-hacking strategies (Stefan and Schönbrodt 2023)
  • The statement by the authors “The higher alpha connectivity in the Brain-only group suggests that writing without assistance most likely induced greater internally driven processing, consistent with the idea that these participants had to generate and combine ideas from memory without external cues” (pg. 78) is further complicated by the issue that it was not controlled how many of the LLM group used the tools to their disposal
  • The issues across bands have similar concerns, the lower/higher connectivity is assuming the fundamental gains and not controlling within-group subject variability in tool adherence
Validity of Task 
  • The essay writing task may not reflect how users actually use GPT. Depending on the individual, they may be comfortable with the task of answering SAT-styled questions, so they will minimally engage with the resource.
  • The task was simple: Write an essay based on the ASSIGNED method that was random rather than chosen. There is no target beyond completion. This lacks predictive validity and construct validity critical to real-world scenarios/use (for a crash-course in validating measures, see Slaney 2017 and Standards for Psychological Testing.
  • No targets for learning or recall. Similar issues can exist in non-GPT scenarios in the real world with test-taking where users cram to meet target scores. Hence, predictive validity of test scores can be poor in real-world performance (e.g., this study focused on GRE predictive utility)
Task Adherence: LLM Group Deviations*
  • In LLM group: “another preferred ‘the Internet over ChatGPT to find sources and evidence as it is not reliable’ (P13)” (pg. 38)
  • Unclear how many others didn’t rely on tool but did not report.
    • Evaluated of GPT consults should have used some estimation score to rank participants’ utility of the targeted tool and then the semantic convergence between the range of GPT produced content and resulting content in submitted surveys.
Bias in Writing
  • There is more positive tone towards Search Engines than LLMs, rather than balancing the pros and cons equally. For example, search engines are optimized to rank content based on relevance scores, authority rankings, user personalization and other engagement information. Thus, page 1 may be biased content—does that make it more superior than an LLM which would summarize based on some ranking that uses probability weights across billions of parameters?
  • Pg. 30 other comments: they add the pros people report for brain-only and the negatives for LLM

Conclusion

While the MIT “Brain on ChatGPT” analysis present an intriguing set of data on EEG activity during LLM-assisted writing, the methodological flaws undermine the broad conclusions that have gain traction across social media domains. The reliance on a small, self-selected sample for its key Session 4 findings, combined with invalid assumptions about EEG connectivity and an unclear LLM usage, suggests these results may be contaminated by other features of the measurements and design. In a way, this research demonstrates the dangers of drawing broad conclusions from exploratory studies with significant design limitations and public interest.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *