Altered datasets raise more questions about reliability of key studies on coronavirus origins

Revisions to genomic datasets associated with four key studies on coronavirus origins add further questions about the reliability of these studies, which provide foundational support for the hypothesis that SARS-CoV-2 originated in wildlife. The studies, Peng Zhou et al., Hong Zhou et al., Lam et al., and Xiao et al., discovered SARS-CoV-2-related coronaviruses in horseshoe bats and Malayan pangolins.

The studies’ authors deposited DNA sequence data called sequence reads, which they used to assemble bat- and pangolin-coronavirus genomes, in the National Center for Biotechnology Information (NCBI) sequence read archive (SRA). NCBI established the public database to assist independent verification of genomic analyses based on high-throughput sequencing technologies.

U.S. Right to Know obtained documents by a public records request that show revisions to these studies’ SRA data months after they were published. These revisions are odd because they occurred after publication, and without any rationale, explanation or validation.

For example, Peng Zhou et al. and Lam et al. updated their SRA data on the same two dates. The documents don’t explain why they altered their data, only that some changes were made. Xiao et al. made numerous changes to their SRA data, including the deletion of two datasets on March 10, the addition of a new dataset on June 19, a November 8 replacement of data first released on October 30, and a further data change on November 13 — two days after Nature added an Editor’s “note of concern” about the study. Hong Zhou et al. have yet to share the full SRA dataset that would enable independent verification. While journals like Nature require authors to make all data “promptly available ” at the time of publication, SRA data can be released after publication; but it is unusual to make such changes months after publication.

These unusual alterations of SRA data do not automatically make the four studies and their associated datasets unreliable. However, the delays, gaps and changes in SRA data have hampered independent assembly and verification of the published genome sequences, and add to questions and concerns about the validity of the four studies, such as:

What were the exact post-publication revisions to the SRA data? Why were they made? How did they affect the associated genomic analyses and results?
Were these SRA revisions independently validated? If so, how? The NCBI’s only validation criterion for publishing an SRA BioProject– beyond basic information such as “organism name”– is that it cannot be a duplicate.

For more information

The National Center for Biotechnology Information (NCBI) documents can be found here: NCBI emails (63 pages)

U.S. Right to Know is posting documents from our public records requests for our biohazards investigation. See: FOI documents on origins of SARS-CoV-2, hazards of gain-of-function research and biosafety labs.

Background page on U.S. Right to Know’s investigation into the origins of SARS-CoV-2.

Written by Sainath Suryanarayanan

For more information

Get our newsletter | Weekly updates in your inbox

It's Your Right to Know