Have memory studies shown the Gospels are likely to be unreliable?

Bart Ehrman has been a leading figure claiming psychological research on memory challenges the reliability of the Gospels. Historians on both sides of the debate have provided insightful comments. However, as far as I’m aware, there has not yet been a critical evaluation of Ehrman’s work by psychologists.

I hope then as a psychologist I can help fill this gap. I am not an expert on memory research. However, I am an expert in critically evaluating psychological research and the applications of these findings.

I’ve based my response on the debate between Bart Ehrman and Richard Bauckham. I think he makes two main arguments from the findings of psychological research:

  • Human memory is fundamentally unreliable: it is therefore highly uncertain that eyewitnesses of the events would accurately recall what happened.
  • We often get confused between the events that happened and the things we have imagined. Therefore, it is extremely difficult to tell when our memories reflect what happened or are just a product of our imagination.

Before tackling these two main points, I want to make some general observations about some of the pitfalls Ehrman falls into when summarising this large complex field of memory research. I also try to map a way out of these pitfalls – borrowing from the principles of systematic reviews and evidence synthesis. These methods are developing rapidly in medical research and have influenced many other disciplines including psychology and biology.

The problem of selectively highlighting evidence that supports his case

Ehrman covers several topics within memory research such as ‘flashbulb’ memories and false memories which consists of many dozens of studies conducted over several decades. He hasn’t taken this lightly reading hundreds of studies over several years which is to his credit. However, I think there are some important limitations in how Ehrman synthesises the available literature.

Creative Commons license CC-BY-NC-ND

The cartoon above by Hilda Bastian, an expert in synthesising medical evidence, satirizes the common distortions that often occur even when experts try to summarise the findings of studies on a particular topic:

“Unfortunately, little Suzy isn’t the only one falling for the temptation to dismiss or explain away inconvenient performance data. Healthcare [perhaps even more so in cognitive psychology] is riddled with this, as people pick and choose studies that are easy to find or that prove their points. In fact, most reviews of healthcare evidence [likely more so in cognitive psychology] don’t go through the painstaking processes needed to systematically minimize bias and show a fair picture.”

Hilda Bastian https://statistically-funny.blogspot.com/2013/04/look-ma-straight-as.html

She goes onto provide a nice summary of what a systematic review is and the principles for overcoming these problems:

 “A fully systematic review very specifically lays out a question and how it’s going to be answered. Then the researchers stick to that study plan, no matter how welcome or unwelcome the results. They go to great lengths to find the studies that have looked at their question, and they analyze the quality and meaning of what they find.”

Hilda Bastian

One of the key points is that selective citation of studies that support our view leads to biased conclusions. To tackle this issue I’ll aim to present more of the context of the literature with the help of systematic reviews and other studies where available.

Failing to critically analyse evidence often leads to misleading conclusions

Ehrman often discusses psychological studies without sufficient nuance or critical analysis.

How do we tackle this problem? We need a transparent way of evaluating how certain we should be that the study’s findings reflect the real world.

I will use GRADE (Grading of Recommendations, Assessment, Development and Evaluations) categories to do this. This approach is the standard method for evaluating the quality of evidence across medicine. GRADE has also been widely applied to evaluate research in psychology and other disciplines such as ecology.

GRADE has five main domains leading to a possible rating of High, Moderate, Low or Very Low:

  • Study limitations: have the studies minimized bias (e.g. errors that lead to misleading findings)? If so can we trust their findings reflect the truth?
  • Inconsistency: do the studies all come to the same conclusion or do they differ substantially?
  • Imprecision: is there enough evidence for us to conclude?
  • Indirectness: are we confident the studies can be applied to the question (for us the reliability of the disciples’ memories)?
  • Publication bias: It’s well established that studies with less exciting findings are less likely to be published. This results in a distortion of the overall evidence. Is there reason to suspect this may have impacted on findings?

For the GRADE approach, certainty begins at High but concerns in any of these domains result in a downgrade to a lower level (Moderate) or if very serious two levels (Low).

Issue 1: Ehrman on the unreliability of memory: Challenger Disaster study

With that background, we are now able to assess Ehrman’s use of psychological evidence.

He uses a study looking at university students’ memory of the Challenger disaster. He tells the story engagingly – the students later responses are often confused and many can’t even remember filling in a questionnaire at the time of the disaster. He concludes psychologists have shown we cannot trust our memories.

Ehrman is quoting a classic study by Neisser and Harsch (1992). But as I’ve argued above, it can be misleading to consider an individual study in isolation. There are numerous studies on ‘flashbulb memories’ (memories of unexpected events with high emotion attached to them). So I’ll look at how this study sits within the context of several similar studies.

Rating the quality of evidence: Very Low

The only systematic review I could find on this topic was van Giezen et al (2005). This review is a little out of date but includes 18 studies on flashbulb memories similar to Neisser and Harsch (1992). I will use this paper to provide context on the wider body of evidence.

Study limitations

Van Giezen et al (2005) rated the quality of individual studies as ‘relatively low’. For example, the Neisser and Harsch (1992) study get a score of 2 out of 5 for study quality. This indicates a weak set of studies.


There is considerable inconsistency in findings. Van Giezen et al (2005) found reasonable accuracy in recall over time in some studies but a decline in accuracy in others – studies after 2005 show a similar pattern.

To illustrate this inconsistency, let’s look at a later study by Neisser (1996) who aimed to understand poor recall by his students of the Challenger disaster. Neisser investigated whether people more directly affected by an event had more accurate memory.  He found Californian students’ had near-perfect accuracy (96-99%) recalling events 18 months later related to the 1989 Loma Prieta earthquake in California. However, Atlanta students showed much lower accuracy in recall (55% accuracy) at 18 months similar to the earlier study quoted by Ehrman.

The inconsistency of findings across studies and in subgroups within studies means we cannot reliably apply the findings of Neisser and Harsch (2002) to make general statements about the unreliability of memory. There is clear evidence that findings on the accuracy of memory may differ depending on who and what event is being targeted in the study (we’ll look at this in more detail in the next section).


There are many studies on ‘flashbulb memories’. Therefore we have a lot of data to inform conclusions.


When thinking about indirectness, we have to think about what we are using the evidence to inform – in this case the reliability of the Bible. As noted by Van Giezen et al (2005), the data are focused largely on students recalling events they may or may not have heard on the news. It is questionable if these data can inform the recollections of people witnessing the events of Jesus’ life.

Publication bias

The nature of the data makes it challenging to assess publication bias reliably – and so far analyses have not sufficiently investigated this possibility.


This is not a formal GRADE assessment but we have used the GRADE approach as a framework for evaluating studies on ‘flashbulb memory’. We have identified problems with study limitations, inconsistency of evidence, and indirectness. Based on this assessment, I’ve rated the quality of evidence very low.

This is in contrast with Ehrman’s claim that such memories have been shown conclusively to be unreliable. It’s inappropriate to draw strong conclusions given the problems with the data.

Does the accuracy of memory differ depending on how involved you are with events?

I think quite rightly, Bauckham responds to the Challenger
disaster data by pointing out:

  • This shows we can and do get mixed up recalling events happening in the past – our memories aren’t 100% accurate
  • There are a variety of factors that influence how well we remember events: involvement with the event, emotions related to events, and whether it was surprising.

Ehrman counters by suggesting Bauckham is appealing to lay intuitions. It’s an impressive sounding rhetorical flourish – however, I think it’s misleading as there has been an extensive study of factors that impact on the reliability of memory.

Studies on the accuracy of memory for people directly experiencing events

I’m going to expand on the finding above from Neisser (1996) where we’ve already seen Californian students had a very accurate memory of the Loma Prieta earthquake – while students more remote to the location had a much poorer recall.

Marmara, Turkey

A similar study replicated the findings of Neisser (1996) – investigating memory of the Marmara earthquake in Turkey. But this time, Er (2003) studied a bigger (665 people) and more representative (i.e. not just students) sample of people.

She compared those who lived in Marmara (and directly experienced the earthquake) with people who lived in Mersin (who had not been affected by the earthquake). Six months later, participants who directly experienced the earthquake had near-perfect accuracy in recalling most events while recall was substantially less accurate in participants who did not directly experience the event.

Are these highly accurate memories just for 6 months or 18 months? What about memory over much longer periods?

It’s very difficult to study the accuracy of memory over many decades, so there aren’t many studies that look into this. Berntsen and Thomsen (2005) assessed memory for the weather and activities carried out on exact dates (e.g. invasion and liberation in Denmark) during World War II – between 63-68 years after the events. People who lived through the events were compared with members of the psychology faculty who did not live through war times.

  Danes who lived through the war Danes who were not yet alive at the time
(day of invasion/ liberation)
67-69% correct
14-16% no answer  
8% incorrect
8-11% mixed
5% correct
82-86% no answer
3-5% incorrect
6-11% mixed
Sunday or workday
78-86% correct
15-20% no answer
1-2% incorrect
26-28% correct
62-66% no answer
8-11% incorrect
Day of the week
23-28% correct
63-64% no answer
10-13% incorrect
3-5% correct
86-89% no answer
11% incorrect
Exact Date of dark
shades demand
43% correct
33% no answer
24% incorrect
5% correct
85% no answer
11% incorrect

The contrast between groups is compelling:

  • Those who lived through the war recalled events with surprising accuracy – the majority correctly recalled the weather on the day of invasion and liberation of Denmark from the Nazis, and whether these events happened on a Sunday or workday.
  • Those who lived through the war had substantially more correct responses for all questions compared to the psychology faculty who did not directly experience the war.
  • Older Danes recall was lower for more specific details: recalling the actual day of the week of the invasion and liberation, and also the date that the invading Germans required households to draw their curtains. But still much more accurate than those who did not experience the events.
  • Both groups provided low numbers of incorrect responses for most questions – they showed little evidence of wildly inaccurate recall that you would expect from Ehrman’s discussion of the evidence.

The study also found another interesting difference. People connected to the Danish Resistance Movement had more accurate memories of events than those who were not connected to the Resistance Movement. This further highlights how being closely connected to events is important for accurate memory.

Therefore, we find what’s called a dose-response effect. Those who are least connected to the events (those who were not yet alive in WWII) had the least accurate memory. People who directly experienced events remembered them more accurately. Those with the most accurate memory were those with more at stake (i.e. those with connections to the Danish resistance movement).

GRADE rating: Moderate certainty.

Study limitations: as in the wider literature, the studies are not of particularly high quality, therefore, it is appropriate to raise concerns for this domain.

Inconsistency: the
evidence consistently shows those who directly experience an incident have
substantially more accurate memories than those who do not. Therefore, no
substantial problems with inconsistency have been identified.

Imprecision: the studies discussed are large (except Neisser (1996)). In my judgment, there is sufficient data to conclude.

Indirectness: we can assume Ehrman considers the Challenger study to apply to our question – since he wouldn’t have cited the study. Now to increase our certainty of the applicability of evidence we would have to judge whether studies investigating memory in people directly experiencing and heavily involved in events are more applicable to our question than studying the memory of events people have heard about on the news. This appears straightforwardly to be the case.

Publication bias: as
before we don’t have sufficient data to assess this.

Looking across the five domains – only study limitations have led to concerns, therefore, my rating of the evidence is Moderate.

Issue 2: Do people commonly mistake events they imagined from what happened?

Ehrman quotes a study illustrating we are highly susceptible to develop false memories. Based on his description, I think he is referring to Seamon et al (2006). 40 undergraduate students were taken on a campus tour and asked either to perform many familiar (e.g. check the Pepsi machine for change) or bizarre (e.g. propose marriage to a Pepsi machine) actions or imagine performing these actions. Here is Ehrman’s summary :

“It turns out that 2 weeks later after they interview the students. When they ask students ‘do you remember doing this? Do you remember going down on one knee and proposing marriage to the Pepsi machine?’ If the student had simply imagined doing it two weeks later they remember actually doing it.’

Bart Ehrman (from around 19:00 on the video)

Let’s go back to the study to see if this reflects the data:

The extent to which the study can identify ‘false memory’ is unknown

The study cited by Ehrman can be categorised technically by memory researchers as ‘imagination inflation’. Brewin and Andrews (2017) point out almost all studies of imagination inflation (including the study cited by Ehrman) assesses belief about memory and not recollective experience.

Although belief about memory is a component of memory,
evidence of false memory requires assessing a person’s ‘recollective experience’
of the events and their confidence in the validity of this experience. Therefore
“the extent to which these procedures produce full autobiographical memories is
unknown” p18

The data suggests it was rare for participants to believe imagined events happened.

Listening to Ehrman’s summary, I was expecting to see very high rates of ‘false memory’.

However, the data shows the opposite. On average, 7-12% of all imagined events were incorrectly believed to have been performed by participants in the study. So, on average, the other 88-93% of imagined events were correctly remembered as being imagined.

Therefore Ehrman’s summary does not reflect what the study found. Yes, of course, we should be aware that our memories are fallible and we can sometimes recall events incorrectly. What the study doesn’t show is this is a common problem.

The data shows overall a high level of accuracy for recalling events

This is surprising as people were asked to recall from 72 actions/imaginations over 2 different sessions in an experiment designed to identify memory distortions (they probably did better than I would have!):

  • On average 78-91% correctly recalled performing a ‘bizarre’ action – this was a little lower for ‘familiar’ actions (on average 64-77%)
  • On average only 2% claimed to have performed an action that they had neither performed or imagined

Can we generalise these results to everyday or historical events?

There are several aspects of the study which limit our confidence in applying these data to the reliability of the Gospels. Here are a few:

  • This study is specifically designed to confuse imagined and actual events in an inevitably artificial manner. This is a limitation for many ‘laboratory’ studies of human participants. That’s not to devalue the research – it’s just to point out that it is far from obvious whether these findings generalise to how people in their real lives get mixed up between imagined and actual events. The nature of the study suggests even if the results are valid they may reflect an overestimate of rates of confusing imagined and real events.
  • The very low rates of ‘false memory’ (believing that imagined events happened) weakens the validity of the findings. Human studies are extremely vulnerable to the expectations both of the investigators in the study and the participants whose memories are being tested. For example, some studies have shown that ‘expectation effects’ can inflate estimates by approximately 30%. Therefore, it is difficult to disentangle the real effects they are seeking to measure from the inevitable ‘noise’ from factors like expectation effects that happen in human experiments.

What does the wider literature of similar studies show?

It’s good to again contextualize this study with other similar studies investigating false memory. A recent systematic review has looked at a large number of studies in this area specifically focusing on attempts to elicit false childhood memories in adults.  This is particularly interesting as we would expect to see the highest rates of false memory for these types of memories.

Brewin and Andrews (2017) included 16 studies on imagination inflation (where studies assess whether imagining an event can increase the belief that the event happened), 15 studies on false feedback (where impact of providing false feedback to study participants is measured) and  20 studies on memory implantation (where psychologists try to implant memory of participants childhoods from scratch).

Brewin and Andrews, looking at a wider range of studies comes to conclusions similar to my critique of the study cited by Ehrman above:

  • “There are sufficient grounds to conclude that a (probably small) minority of people might develop false memories of childhood events with these characteristics and that any such memories might contain a mixture of true and false elements.” p20
  • “…we believe it cannot be concluded that false memories of childhood events possessing these characteristics are common, that they are easy to suggest or implant or that the majority of individuals are susceptible to them.” p20

Summary and Conclusions

Points of agreement and disagreement

We’ve come a long way so its good to sum up by looking at where I agree and where I disagree with Ehrman:

Points of agreement

We agree that human memory is fallible:

  • For some events, our memories can be unreliable. We can’t take for granted the accuracy of memory for all events.
  • We can develop false memories including through the manipulation of others or mistakenly mixing up events that we had imagined with what happened.

 Points of disagreement

We disagree it seems to me quite profoundly on the extent of
that fallibility:

  • Ehrman dismisses evidence of greater accuracy in memory for those most closely connected to events. I see no reason why this evidence should be dismissed and appears to be of greater quality and relevance for understanding the memories of the disciples than other studies he cites.
  • Ehrman appears to think it is common for us to mix up the memory of imagined actions with what happened. I’ve shown this is relatively rare both in the study he cited and a much wider body of studies.

Where I disagree with his presentation of psychological data

I’ve highlighted three main areas where I have problems with how Ehrman has interpreted and presented the psychological literature on memory. I think he has:

  • Cited selectively studies that fit what he wants to say without reflecting the variability of findings in similar studies.
  • Ignored studies that inconveniently do not reflect the narrative he wants to tell
  • Interpreted studies to fit a narrative that isn’t reflected in the data


  1. Great post. Very interesting. As someone new to the psychology of memory do you have a recommendation of a good book to read?


    1. Thanks James! I can’t think of any popular level books on memory. But there are some good textbooks at the introductory level that are well written:


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s