Interactive Import of Epi Metadata

Epidemiological metadata can be typed in per sample via the sample overview panel or imported via a MS Excel or CSV file. Most of those data fields are modeled according the Global Microbial Identifier (GMI)/NCBI minimum epidemiological data requirements (e.g., FDA GenomTrakr or CDC are using these fields). The focus of these requirements lies obviously on the three classical epidemiological dimensions, i.e., place, time, and ‘person’ information. In contrast to the procedure detail and statistic data fields, a SeqSphere+ user can create next to the default epidemiological data fields as many as wished additional epidemiological data fields by defining a customized database scheme for storing epidemiological data.

Automatic Import of Metadata

Metadata can be automatically imported from files when using the pipeline. Either CSV or SPEC files can be used for import.

CSV-files

Comma-separated-values (CSV) files can be used to import metadata in a pipeline. The CSV-files can use , or ; as separator.

The first row of a CSV file is used as the header. The field names described in the section Epi metadata fields below can be used to name the columns. In addition, if the label for a sample field is unique within a database schema, this label can also be used in the header row to define the sample field.

Tags can be set by specifying tg. as prefix in the header followed by the tag name, e.g. tg.MyTag. The columns for each sample indicate whether the tag should be used on the given sample. Use "yes" or "true" or "1" to set the tag for a sample.

Dates can be given as yyyy-MM-dd.

One CSV-file for multiple pipeline samples

A CSV-file with name metadata.csv can be added to the contig or read-file directory. The CSV-file can contain data for multiple Samples. The first row is used as header.

The first header-column named sample id (ignoring case) is used as column that contains the Sample Ids. Each row of data is then checked to see if it contains the Sample ID of the currently processed Sample in this column. If this is the case, the 'Epi Metadata Fields' and 'Procedure Details and Statistics Fields' of the currently processed sample are filled with the data from this row.

If multiple rows exist that match the Sample Id, they are processed from top to bottom.

Example for a CSV-file for multiple pipeline samples:

sample id,Strain,pf.assembler,pf.assembler_version,tg.T2,tg.T3
DE9622,strain 1,MIRA,3,yes,no
DE9686,strain 2,MIRA,3,no,yes

Note that tag T2 will be set for DE9622 in this example, and DE9686 will be tagged with T3.

Values from CSV-files for multiple pipeline samples overwrite values from CSV-files for single samples if both are present.

CSV-files for single samples

A CSV-file for a single sample must have the same name as the input sequence file (e.g., FASTA or FASTQ) or the sample id but with the file name extension ".csv". It must be placed in the contig or read-file directory. The first row is used as header, the second row contains the data. All other rows are ignored.

Example for a sample CSV-file:

ef.Characteristic.genus,ef.Characteristic.species,Strain,tg.T2,tg.T3
Escherichia,coli,strain 3,yes,yes

Note that tags T2 and T3 will be set for the sample in this example.

SPEC-files

SPEC files can also be used in SeqSphere to export and import Metadata.

The SPEC file for a sample must have the same name as the input sequence file (e.g., FASTA or FASTQ) but with the file name extension ".spec". If a specific filenaming is used in a pipeline, the SPEC file may also have the name of the sample ID. Additionally, a single SPEC file can also be defined for all sequence files of its directory, if it is named "sequence_specification.spec". If multiple SPEC files are found for a sample, they are merged together.

The content of a SPEC file is plain text (UTF-8) where each line holds a single field and value pair, in the format: field=value (e.g., pf.avg._coverage_(assembled)=111 ). The fields may be in any order. The following fields can be set in a SPEC file and will be imported as Metadata. Dates can be given as yyyy-MM-dd.

Epi Metadata Fields

The following names can be used to specify epi metadata fields in CSV or SPEC-files:

ef.Sample.alias_id
ef.Sample.isolationDate
ef.Sample.receiptDate
ef.Sample.sample_id_of_collector
ef.Sample.sender
ef.Sample.comment
ef.Sample.modifiedDate
ef.Sample.createdDate
ef.Sample.submittedDate
ef.Sample.downloaded_from
ef.Sample.submitted_to
ef.Source.source_type
ef.Source.source_subtype
ef.Source.host
ef.Source.host_age
ef.Source.host_sex
ef.Source.host_disease
ef.Source.isolation_source
ef.Source.isolation_country
ef.Source.isolation_state
ef.Source.isolation_city
ef.Source.isolation_zip
ef.Source.isolation_lat_long
ef.Source.lat_long_resolution
ef.Source.cluster_outbreak
ef.Source.epi_info
ef.Source.case_id
ef.Source.ecdc_case_id
ef.Characteristic.genus
ef.Characteristic.species
ef.Characteristic.subspecies
ef.Characteristic.strain
ef.Characteristic.genotype
ef.Characteristic.serotype
ef.Characteristic.pathotype
ef.Characteristic.identification_method
ef.Characteristic.identification_kit_vendor
ef.Characteristic.culture_collection
ef.Characteristic.pubmed_id
ef.Characteristic.study
ef.Characteristic.ncbi_accession
ef.Characteristic.experiment_accession
ef.Characteristic.sample_accession
ef.Characteristic.study_accession
ef.Report.report_comment

Procedure Details and Statitistics Fields

pf.library_source
pf.library_strategy
pf.sequencing_protocol
pf.sequencing_vendor
pf.assembly_pre-processing
pf.assembly_type
pf.assembler
pf.assembler_version
pf.assembler_parameters
pf.assembly_post-processing
pf.expected_genome_size_for_downsampling
pf.downsampled_to_coverage
pf.top_species_match
pf.top_species_match_identity
pf.top_species_match_shared-hashes
pf.contamination_check_result
pf.fastqc_per_base_sequence_quality_(forward_reads)
pf.fastqc_per_base_sequence_quality_(reverse_reads)
pf.fastqc_adapter_content
pf.avg._coverage_(unassembled)
pf.avg._coverage_(processed,_unassembled)
pf.avg._read_length_(unassembled)
pf.avg._read_length_(processed,_unassembled)
pf.read_count_(unassembled)
pf.read_count_(processed,_unassembled)
pf.read_base_count_(unassembled)
pf.read_base_count_(processed,_unassembled)
pf.contig_count_(assembled)
pf.n50_(assembled)
pf.read_count_(assembled)
pf.read_fwd_count_(assembled)
pf.read_rev_count_(assembled)
pf.consensus_base_count_(assembled)
pf.approximated_genome_size_(mbases)
pf.max_contig_length_(assembled)
pf.min_contig_length_(assembled)
pf.avg._contig_length_(assembled)
pf.avg._coverage_(assembled)
pf.read_base_count_(assembled)

Paths to Raw Read Files Fields

fl.reads.1
fl.reads.2

Contents