Data curation

Curation principles

Data must be captured, standardized, organized, and made accessible to the scientific community in order to have a significant and lasting impact. The role of ViruSurf is to provide up-to-date, accurate, and accessible information, and, through this critical activity, facilitate virology scientific discovery.

Metadata curation

Our first curation contribution is to provide a unique schema (see Viral Conceptual Model - VCM) for different data sources. Different ad-hoc importer modules have been built to load GenBank, RefSeq, COG-UK, and GISAID data; the code is available here. Each source comes with different terminological choices when describing the different metadata. We have surveyed the the terms used in many sources to come up with appropriate reconciling solutions. All import results have been browsed and manually checked by the development team.

Note that for the GISAID-specific interface of ViruSurf we use attribute names as specifie din GISAID EpiCoV database. In case the reconciling terminology from VCM is different, it is reported within parentheses in second position.

Specific curation efforts have been dedicated to:
  • Location information. GISAID provides the Location; we split the field according to the separator "/" and save the first three levels into the GeoGroup, the Country, and the Region. GenBank provides always the Country, often the Region almost always, while the GeoGroup is added from a maintained map of continents-country. In COG-UK we retrieve relevant location information and complement Region and GeoGroup.
  • Collection date. GISAID is reported as communicated by GISAID team. For other sources, to make sequences more easily searchable through date filters, when the specific day is not available, we typically add 01 (the first day of the month) when the specific month is not available, we use default date (01 day, 01 month). If the date is missing in general, we check the referenced publication and use the provided information.
  • Submission date. GISAID provides the field. In GenBank it is retrieved from the JOURNAL field for SARS-CoV2, while at times from the PATENT field for SARS-CoV entries. In COG-UK we set it to null. Curation rules hold in the same way as for collection date.
  • Virus taxon name, taxon ID, and species are critical fields as they depend on the information retrieved through the authoritative NCBI Taxonomy. We make a direct call to Entrez APIs:

    Entrez.fetch(db="taxonomy",id=taxon_id,rettype=None,retmode="xml")

    However, some viruses are described as "species" (such as SARS-CoV), some others as "sub-species" (such as SARS-CoV 2). This may result in confusion, e.g., when comparing SARS-CoV vs. SARS-CoV 2.
    We have thus chosen a comprehensive approach: the field taxon name refers to the most specific term used to define the type of virus under examination, while the field taxon ID is the related numerical identifier withing NCBI Taxon. Instead, Species indicates the direct parent (in the taxonomy graph) of the taxon name (when this is not a "species" according to the taxonomy) or the taxon name itself (when it is a "species").
  • As to the host organism information, we also add the Host taxon ID (which complements the Host taxon name) manually; in some cases we do not find exact match with NCBI Taxonomy entries so we make our translation, e.g., "canine" becomes "Canis lupus familiaris".
    In some rare cases, sequences come with gender and age information, which is parsed directly from the "host"-related string (e.g., host="Homo sapiens; male; age 65").
  • From COG-UK and GISAID we only import SARS-CoV-2 data. For this kind of virus the reference sequence is clear. From GenBank, not all the species of virus have a clearly specified reference sequence. We thus define the reference sequence by previously cross-checking with several research papers to ascertain that the typical reference sequence used for variant calling is defined.
  • It is not always clear how to define a sequence complete or partial. From GISAID this information is imported. For others, to set the "complete" value, in GenBank we look for the keyword "complete genome" in the DEFINITION field. For "partial" value, instead, we look for "partial CDS", "complete cds" or "partial genome". Otherwise (and alwaysin COG-UK), we compute this information according to the following rule: if the sequence length is >= 95% of the reference length then Is Complete = True, else Is Complete = false.
  • Strand, as well as Molecule Type, Is Single Stranded and Is Positive Stranded are curated manually.
  • GC% and N% are provided by GISAID for their data, while have been computed to provide users with an additional information on the quality of the sequence.
  • The information on sequencing technology, assembly method and coverage is only available from GenBank. We look for a specific pattern in the COMMENT field (i.e., ##Assembly-Data-START## < content > ##Assembly-Data-END##
    For the Coverage field, a series of curation steps were necessary, as the values gathered from GenBank were very heterogeneous in the notation.
    • remove strings (possibly within parenthesis) that on the left/right of the numbers (including x, X, reads/nt, >, <.....)
    • commas or dots in between numbers should be treated in two different ways:
      • if they have three numbers on their right: remove
      • if they have less than three numbers on their right (so 1 or 2), treat those numbers as decimals and round the unit to the closest one (e.g. 112593.3 becomes 112593, whilst 70.8 becomes 71)
    These rules were applied after carefully checking many examples in the originary research papers to ascertain that our homogenization functions were consistent with the original semantics.


As an additional curation, the records of COG-UK are integrated with 5 additional metadata (Originating lab, Submission date, Submission Lab, Isolation source, IsComplete) when they correspond to GISAID records that hold such information.

Data curation

Data curation is not performed for instances of GISAID, as we do not import sequences from this source. Amino acid level changes are imported from periodical exports from GISAID and simply parsed into our database. For other sources, we engineered our pipelines to provide a unique annotation procedure that will seamlessly process sequences that come with existing annotations as well as sequences that do not.
This process is highly critical for being able to supply users with comparable information.

For the portion of already annotated sequences (i.e., from GenBank), our in-house algorithm compares its results with the existing annotations and, when different these spot cases are manually re-checked to ensure correctness. We extract: structural annotations (filling the \textsc{Annotation} table), nucleotide and amino acid sequences for each annotated segment, nucleotide variants and their impact, amino acid variants for the proteins, other information such as percentage of specific nucleotide bases.

In order to compute such pieces of information, for each virus we manually select a reference sequence and a set of annotations, comprising coordinates for codifying and structural regions, as well as the amino acid sequences of each protein. Usually, such data are taken from the RefSeq entry for the given virus (e.g., NC_045512 for SARS-CoV2), yet our import process is not bound to RefSeq and any source that can provide such information may be used.

For each imported sequence, the pipeline starts by computing the optimal global alignment to the reference by means of the dynamic programming Needleman-Wunsch (NW) algorithm. The time and space complexity of NW is quadratic in the length of the aligned sequences, which often hinders its adoption in real scenarios; however, in the virus case the sequences are relative small and we preferred to use NW rather than faster heuristic techniques to ensure that the optimal alignment is found for every sequence. Moreover, we configure the algorithm to use an affine gap penalty, so as to favor longer gaps which are very frequent at the ends of sequences.

Once the alignment is computed, all the differences (variants) with the reference sequence are collected in the form of nucleotide substitutions, insertions or deletions. Using the SnpEff tool (see SnpEff documentation) we annotate each variant and predict its impact on the codifying regions; indeed, a variant may, for example, be irrelevant (e.g, when the mutated codon codifies for the same amino acid of the original codon), produce small changes, or be deleterious. Based on the alignment result, the sub-sequences corresponding to the reference annotations are identified within the input sequence.

Coding regions are translated into their equivalent amino acid sequences. The translation mechanism takes into consideration annotated ribosomial frameshifts events (e.g., within the ORF1ab gene of SARS-CoV2) to compute the correct amino acid sequence. When the translation fails (e.g., because the nucleotide sequence retrieved from the alignment is empty or its length is not a multiple of 3), we ignore such amino acid product; indeed, such events are mainly due to incompleteness and poor quality of the input sequence and further investigating these sequences for amino acid variants would lead to erroneous information. Instead, when an aligned codon contains any IUPAC character ambiguously representing a set of bases (https://genome.ucsc.edu/goldenPath/help/iupac.html), it is translated into the X (unknown) amino acid, which automatically becomes a variant. Notice that queries selecting known amino acids work correctly independently from unknown ones (that in most cases will not be of interest). Otherwise, the translated amino acid sequence is further aligned -- again using NW -- with the corresponding amino acid sequence, annotated with the reference, and its eventual amino acid variants are inferred.
The sequences provided by COG-UK are already aligned to the reference genome. However, our pipeline firstly reverts the alignment (by simply removing the inserted gaps) and then recomputes it using the same reference sequence and alignment algorithm used for other sources. This choice has been done to ensure uniformity of the alignment among different data sources.

Note that for protein names we use the typical terminology of GISAID, while we put within parentheses in second position the (possibly) alternative terminology adopted by NBI GenBank-related resources.

Overlap detection and management

ViruSurf solves the problem of record redundancy among different databases by using different external references IDs that are either available from the origin sources or by computing in-house the matches between strain names and sequence length. We store this information and allow the possibility of performing queries on just ``GISAID only'' sequences, i.e., that are not present in GenBank, COG-UK or NMDC.

Case/control computation

The `Show control' switch allows to visualize the sequences of the control group, defined by those sequences selected by the Metadata search filters for which the Variant filters are not satisfied. Note that, for some sequences amino acid sequences could not be computed (e.g., because the nucleotide sequence retrieved from the alignment is empty or its length is not a multiple of 3). The "controls" exclude such sequences.

Reference sequences