Altered datasets raise more questions about reliability of key studies on coronavirus origins

Print Email Share Tweet

Revisions to genomic datasets associated with four key studies on coronavirus origins add further questions about the reliability of these studies, which provide foundational support for the hypothesis that SARS-CoV-2 originated in wildlife. The studies, Peng Zhou et al., Hong Zhou et al., Lam et al., and Xiao et al., discovered SARS-CoV-2-related coronaviruses in horseshoe bats and Malayan pangolins.

The studies’ authors deposited DNA sequence data called sequence reads, which they used to assemble bat- and pangolin-coronavirus genomes, in the National Center for Biotechnology Information (NCBI) sequence read archive (SRA). NCBI established the public database to assist independent verification of genomic analyses based on high-throughput sequencing technologies.

U.S. Right to Know obtained documents by a public records request that show revisions to these studies’ SRA data months after they were published. These revisions are odd because they occurred after publication, and without any rationale, explanation or validation.

For example, Peng Zhou et al. and Lam et al. updated their SRA data on the same two dates. The documents don’t explain why they altered their data, only that some changes were made. Xiao et al. made numerous changes to their SRA data, including the deletion of two datasets on March 10, the addition of a new dataset on June 19, a November 8 replacement of data first released on October 30, and a further data change on November 13 — two days after Nature added an Editor’s “note of concern” about the study. Hong Zhou et al. have yet to share the full SRA dataset that would enable independent verification. While journals like Nature require authors to make all data “promptly available” at the time of publication, SRA data can be released after publication; but it is unusual to make such changes months after publication.

These unusual alterations of SRA data do not automatically make the four studies and their associated datasets unreliable. However, the delays, gaps and changes in SRA data have hampered independent assembly and verification of the published genome sequences, and add to questions and concerns about the validity of the four studies, such as:

  1. What were the exact post-publication revisions to the SRA data? Why were they made? How did they affect the associated genomic analyses and results?
  2. Were these SRA revisions independently validated? If so, how? The NCBI’s only validation criterion for publishing an SRA BioProject– beyond basic information such as “organism name”– is that it cannot be a duplicate.

For more information: 

The National Center for Biotechnology Information (NCBI) documents can be found here: NCBI emails (63 pages)

U.S. Right to Know is posting documents from our public records requests for our biohazards investigation. See: FOI documents on origins of SARS-CoV-2, hazards of gain-of-function research and biosafety labs.

Background page on U.S. Right to Know’s investigation into the origins of SARS-CoV-2.

Validity of key studies on origin of coronavirus in doubt; science journals investigating

Print Email Share Tweet

By Carey Gillam

Since the outbreak of COVID-19 in the Chinese city of Wuhan in December 2019, scientists have searched for clues about what led to the emergence of its causative agent, the novel coronavirus SARS-CoV-2. Uncovering the source of SARS-CoV-2 could be crucial for preventing future outbreaks.

A series of four high profile studies published earlier this year provided scientific credence to the hypothesis that SARS-CoV-2 originated in bats and then jumped to humans through a type of anteater called a pangolin — among the world’s most trafficked wild animals. While that specific theory involving pangolins has been largely discounted, the four studies known as the “pangolin papers” continue to provide support for the notion that coronaviruses closely related to SARS-CoV-2 circulate in the wild, meaning the SARS-CoV-2 that caused COVID-19 probably comes from a wild animal source. 

The focus on a wild animal source, the “zoonotic” theory, has become a critical element in global discussion about the virus, directing public attention away from the possibility that the virus may have originated inside a Chinese governmental laboratory – the Wuhan Institute of Virology.

U.S. Right to Know (USRTK) has learned, however, that two of the four papers that make up the foundation for the zoonotic theory appear to be flawed, and that the editors at the journals in which the papers were published – PLoS Pathogens and Nature – are investigating the core data behind the studies and how the data was analyzed. The other two similarly appear to suffer flaws.

The problems with the research papers raise “serious questions and concerns” about the validity of the zoonotic theory overall, according to Dr. Sainath Suryanarayanan, a biologist and sociologist of science, and USRTK staff scientist.  The studies lack sufficiently reliable data, independently verifiable data sets and a transparent peer review and editorial process, according to Dr. Suryanarayanan. 

See his emails with senior authors of the papers and journal editors, and analysis: Nature and PLoS Pathogens probe scientific veracity of key studies linking pangolin coronaviruses to origin of SARS-CoV-2.

Chinese governmental authorities first promoted the idea that the source of the causal agent for COVID-19 in humans came from a wild animal in December. Chinese government-supported scientists then backed that theory in four separate studies submitted to the journals between February 7 and 18.

The World Health Organization’s China Joint Mission Team investigating the emergence and spread of COVID-19 in China stated in February : “Since the COVID-19 virus has a genome identity of 96% to a bat SARS-like coronavirus and 86%-92% to a pangolin SARS-like coronavirus, an animal source for COVID-19 is highly likely.” 

The Chinese-initiated focus on a wild animal source helped chill calls for an investigation into the Wuhan Institute of Virology, where animal coronaviruses have long been stored and genetically manipulated. Instead, resources and efforts of the international scientific and policymaking community have been funneled toward understanding the factors shaping contact between people and wildlife. 

The four papers in question are Liu et al., Xiao et al. , Lam et al. and Zhang et al.  The two that are currently being investigated by the journal editors are Liu et al and Xiao et al. In communications with the authors and journal editors of those two papers, USRTK has learned of serious problems with the publication of those studies, including the following:    

  • Liu et al. did not publish or share (upon being asked) raw and/or missing data that would allow experts to independently verify their genomic analyses.
  • Editors at both Nature and PLoS Pathogens, as well as Professor Stanley Perlman, the editor of Liu et al., have acknowledged in email communications that they are aware of serious issues with these papers and that the journals are investigating them. Yet, they have made no public disclosure of the potential problems with the papers.  

The silence of the journals regarding their ongoing investigations means that wider communities of scientists, policymakers and the public impacted by COVID-19 are unaware of the problems associated with the research papers, said Dr. Suryanarayanan. 

“We believe that these issues are important, since they may shape how institutions respond to a catastrophic pandemic that has radically affected lives and livelihoods worldwide,” he said.

Links to these emails can be found here: 

In July 2020, U.S. Right to Know began submitting public records requests in pursuit of data from public institutions in an effort to discover what is known about the origins of the novel coronavirus SARS-CoV-2, which causes the disease Covid-19. Since the start of the outbreak in Wuhan, SARS-CoV-2 has killed over a million people, while sickening millions more in a global pandemic that continues to unfold.

On Nov. 5, U.S. Right to Know filed a lawsuit against the National Institutes of Health (NIH) for violating provisions of the Freedom of Information Act. The lawsuit, filed in U.S. District Court in Washington, D.C., seeks correspondence with or about organizations such as the Wuhan Institute of Virology and the Wuhan Center for Disease Control and Prevention, as well as the EcoHealth Alliance, which partnered with and funded the Wuhan Institute of Virology.

U.S. Right to Know is a nonprofit investigative research group focused on promoting transparency for public health. You can support our research and reporting by donating here. 

Nature and PLoS Pathogens probe scientific veracity of key studies linking pangolin coronaviruses to origin of SARS-CoV-2

Print Email Share Tweet

Sign up to receive updates from the Biohazards Blog.

By Sainath Suryanarayanan, PhD 

Here, we provide our emails with senior authors of Liu et al. and Xiao et al., and the editors of PLoS Pathogens and Nature. We also present an in-depth discussion of the questions and concerns raised by these emails, which put in doubt the validity of these key studies on the origin of the novel coronavirus SARS-CoV-2 that causes COVID-19. See our reporting on these emails, Validity of key studies on origin of coronavirus in doubt; science journals investigating (11.9.20)


Email communications with Dr. Jinping Chen, senior author of Liu et al:


Dr. Jinping Chen’s emails raise a number of concerns and questions: 

1– Liu et al. (2020) assembled their published pangolin coronavirus genome sequence based on coronaviruses sampled from three pangolins, two samples from a smuggled batch in March 2019, and one sample from a different batch intercepted in July 2019. The National Center for Biotechnology Information (NCBI) database, where scientists are required to deposit sequence data to ensure independent verification and reproducibility of published results, contains the sequence read archive (SRA) data for the two March 2019 samples but is missing data for the July 2019 sample. Upon being asked about this missing sample, which Dr. Jinping Chen identifies as F9, Dr. Jinping Chen stated: “The raw data of these three samples could be found under NCBI accession number PRJNA573298, and the BioSample ID were SAMN12809952, SAMN12809953, and SAMN12809954, moreover, individual (F9) from different batch was also positive, the raw data can be seen in NCBI SRA SUB 7661929, which will be released soon for we have another MS (under review)” (our emphasis).

It is concerning that Liu et al. have not published data corresponding to 1 of the 3 pangolins samples that they used to assemble their pangolin coronavirus genome sequence. Dr. Jinping Chen also did not share this data upon being asked. The norm in science is to publish and/or share all data that would allow others to independently verify and reproduce the results. How did PLoS Pathogens let Liu et al. evade publishing crucial sample data? Why is Dr. Jinping Chen not sharing data pertaining to this third pangolin sample? Why would Liu et al. want to release unpublished data pertaining to this third pangolin sample as part of another study that has been submitted to a different journal? The concern here is that scientists would misattribute the missing pangolin sample from Liu et al. to a different study, making it difficult for others to subsequently trace important details about this pangolin sample, such as the context in which the pangolin sample was collected.

2– Dr. Jinping Chen denied that Liu et al. have had any relationship with Xiao et al.’s (2020) Nature study. He wrote: “We submitted our PLOS Pathogens paper on Feb.14, 2020 before the Nature paper (the Reference 12 in our PLOS pathogens paper, they submitted on Feb.16, 2020 from their submit date in Nature), our PLOS pathogens paper explain that SARS-Cov-2 is not from pangolin coronavirus directly and pangolin not as intermediate host. We knew their work after their news briefing on Feb. 7, 2020, and we have different opinions with them, the other two papers (Viruses and Nature) have been listed in the PLOS Pathogen paper as reference papers (reference number 10 and 12), we are different research groups from Nature paper authors, and there is no relationship with each other, and we took samples with detail sample information from the Guangdong wildlife rescue center with helps from Jiejian Zou and Fanghui Hou as our co-authors and we don’t know where the samples of the Nature paper from.” (our emphases)

The following points raise doubts about Dr. Chen’s claims above: 

a– Liu et al. (2020), Xiao et al (2020) and Liu et al. (2019) shared the following authors: Ping Liu and Jinping Chen were authors on the 2019 Viruses paper and the 2020 PLoS Pathogens paper, senior author Wu Chen on Xiao et al. (2020) was a co-author of the 2019 Viruses paper, and Jiejian Zhou and Fanghui Hou were authors on both Xiao et al. and Liu et al. 

b– Both manuscripts were deposited to the public preprint server bioRxiv on the same date: February 20, 2020. 

c– Xiao et al. “renamed pangolin samples first published by Liu et al. [2019] Viruses without citing their study as the original article that described these samples, and used the metagenomic data from these samples in their analysis” (Chan and Zhan). 

d– Liu et al.’s full pangolin coronavirus genome is 99.95% identical at the nucleotide level to the full pangolin coronavirus genome published by Xiao et al. How could Liu et al. have produced a whole genome that is 99.95% identical (only ~15 nucleotides difference) to Xiao et al. without sharing datasets and analyses?

When different research groups independently arrive at similar sets of conclusions about a given research question, it significantly increases the likelihood of truth of the involved claims. The concern here is that Liu et al. and Xiao et al. were not independently conducted studies as claimed by Dr. Chen. Was there any coordination between Liu et al. and Xiao et al. regarding their analysis and publications? If so, what was the extent and nature of that coordination? 

3– Why did Liu et al. not make publicly available the raw amplicon sequencing data that they used to assemble their pangolin coronavirus genome? Without this raw data, the pangolin coronavirus genome assembled by Liu et al., others cannot independently verify and reproduce the results of Liu et al. As mentioned earlier, the norm in science is to publish and/or share all data that would allow others to independently verify and reproduce the results. We asked Dr. Jingping Chen to share Liu et al.’s raw amplicon sequence data. He responded by sharing Liu et al.’s RT-PCR product sequence results, which are not the raw amplicon data used to assemble the pangolin coronavirus genome. Why is Dr. Jinping Chen reluctant to release the raw data that would allow others to independently verify Liu et al.’s analysis.

4– Liu et al. Viruses (2019) was published in October 2019 and its authors had deposited their pangolin coronavirus (sequence read archive) SRA data with NCBI on September 23, 2019, but waited until January 22, 2020 to make this data publicly accessible. Scientists typically release raw genomic sequence data on publicly accessible databases as soon as possible after the publication of their studies. This practice ensures that others can independently access, verify and utilize such data. Why did Liu et al. 2019 wait 4 months to make their SRA data publicly accessible? Dr. Jinping Chen chose not to directly answer this question of ours in his response on November 9, 2020.

We also got in touch with Dr. Stanley Perlman, PLoS Pathogens Editor of Liu et al. and this is what he had to say.

Notably, Dr. Perlman acknowledged that:

  • “PLoS Pathogens is investigating this paper in more detail” 
  • He “did not verify the veracity of the July 2019 sample during pre-publication peer review”
  • “[c]oncerns about similarity between the two studies [Liu et al. and Xiao et al.] came to light only after both studies had been published.”
  • He “did not see any amplicon data during peer review. The authors provided an accession number for the assembled genome…although after publication it came to light that the accession number listed in the article’s Data Availability Statement is incorrect. This error and questions around the raw contig sequencing data are currently being addressed as part of the post-publication case.”

When we contacted PLoS Pathogens with our concerns about Liu et al. we got the following response from the Senior Editor of the PLoS Publication Ethics team:

Emails from Xiao et al.

On October 28, the Chief Biological Sciences Editor of Nature replied (below) with the key phrase “we take these issues very seriously and will look into the matter you raise below very carefully.” 

On October 30, Xiao et al. finally publicly released their raw amplicon sequence data. However, as of the publication of this piece, the amplicon sequence data submitted by Xiao et al. is missing the actual raw data files that would allow for others to assemble and verify their pangolin coronavirus genome sequence.

Important questions remain that need to be addressed: 

  1. Are the pangolin coronaviruses real? The caption for Figure 1e in Xiao et al. states: “Viral particles are seen in double-membrane vesicles in the transmission electron microscopy image taken from Vero E6 cell culture inoculated with supernatant of homogenized lung tissue from one pangolin, with morphology indicative of coronavirus.” If Xiao et al. isolated the pangolin coronavirus, would they share the isolated virus sample with researchers outside of China? This could go a long way toward verifying that this virus actually exists and came from pangolin tissue.
  2. How early in 2020, or even 2019, were Liu et al., Xiao et al., Lam et al. and Zhang et al. aware that they would be publishing results based on the same dataset?
    a. Was there any coordination considering that one was preprinted on February 18 and three were preprinted on February 20?
    b. Why did Liu et al. (2019) not make their sequence read archive data publicly accessible on the date they deposited it on NCBI’s database? Why did they wait until January 22, 2020 to make this pangolin coronavirus sequence data public.
    c. Before the Liu et al. 2019 Viruses data was released on NCBI on January 22, 2020, was this data accessible to other researchers in China? If so, what database was the pangolin coronavirus sequencing data stored on, who had access, and when was the data deposited and made accessible?
  3. Will the authors cooperate in an independent investigation to track the source of these pangolin samples to see if more SARS-CoV-2-like viruses can be found in the March to July 2019 batches of smuggled animals—which could exist as frozen samples or be still alive in the Guangdong Wildlife Rescue Center?
  4. And will the authors cooperate in an independent investigation to see if the smugglers (were they imprisoned? or fined and let go?) have SARS virus antibodies from regular exposure to these viruses?