Making Sense of Big Data: Standards in Genomic Sciences


symbol-300x69Nearly 20 years into full genome sequencing, today the cost of sequencing a single genome, once a feat of only the well-funded, is ceasing to be a matter of discussion.

Sequencing is merely data gathering for the biology. With this change, journals too have changed. Looking to publish scientific breakthroughs, many journals have stopped considering the sequencing of a whole genome a scientific advancement, or work worthy of publication.

But as the cost of sequencing drops, the amount of data in public databases rapidly accumulates. And as with all big data, its great potential lies in its size. We need to be able to view genomes in comparison. Yet there is no contextualisation for this data: what methods were used to produce and annotate the sequence, information on the source organism and its habitat, its phenotypic features, and genome statistics and details. This deluge of data needs more standardised metadata—from which true insight will come.

With this in mind, the Genomic Standards Consortium (GSC) started the journal Standards in Genomic Sciences (SIGS). The Consortium aims to contextualise genomic data and the reporting of it in a systematic way. But adding such metadata isn’t easy. It takes that rare substance called time. And who wants to do something for nothing? Standards in Genomic Sciences allows authors to contextualise their data and receive credit for it in the form of a Short Genome Report. The journal itself is a true data resource for human and machine readers alike.

Back in 2011, Jonathan Eisen, in his well-read blog Tree of Life, noted on this need for better metadata in the face of cheaper sequencing:

“When I first heard about these standards developments, I was bored almost to tears.  But now I realize that this is a very important aspect of getting the most out of genome data.  If people who sequence a genome not only release the sequence data, but also a table of information about the project, such as information about the organism (e.g., aerobic vs anaerobic, location of isolation) and about the data production (e.g., sequencing methods used) then people will be able to do high throughput analyses of these features.  Then we will not just be looking at sequence but also connecting these sequences to organismal features.  Right now that is very hard to do since genome data is rarely accompanied by machine usable information about the organism that has been sequenced.”

Five years on in 2014, this experiment called SIGS appears a success. The journal now has an Impact Factor of 3.17 and publishes around 120 articles a year. Today, we welcome SIGS onto the BioMed Central platform as their publishing partner. You can read more about SIGS in our launch editorial by Dr George Garrity.

We launch with several interesting Short Genome Reports, including the genome of Escherichia coli. E. coli is perhaps one of the most important of all model organisms in all of biology. While the genome of E. coli K12 and other lab strains have been sequenced, the type strain, on which the species (and genus) are based, has not.

Readers can explore this important paper to discover what differences exist. Our launch articles also include a Short Genome Report of Ensifer medicae, an important nitrogen fixing microsymbiont. Another article of interest is the Hospital microbiome report, a meeting report describing the change in the microbiome of a newly opened hospital.

We encourage readers to explore all of the 20 launch articles at the new Standards in Genomic Sciences platform and to sign up for article alerts to the journal. Early in 2015 we also hope to add further functionality to help link data to the literature, progressing the aim of SIGS. To submit your own Short Genome Report, please explore the author template on the journal website. We welcome your feedback and questions.

View the latest posts on the Research in progress blog homepage