Report from the Publishing Open Data Working Group meeting, 17th June 2011

On 17th June BioMed Central held a Publishing Open Data Working Group meeting, proposed in the spring, in London, UK. This post is a summary report from the meeting, including the next steps for the stakeholders involved. The meeting has also been reported by Alex Ball on the Digital Curation Centre blog. Many thanks to all the attendees acknowledged below for their contributions. While an important reason for convening the meeting was to stimulate debate amongst authors, editors, publishers, funders and librarians, it’s excellent to report that there are a number of mutually agreeable ways forward on all three of the meeting’s proposed goals. Please note that the actions and views stated do not necessarily represent the views of all attendees.

Goal 1: Establish a process and policy for implementing a variable publishers’/authors’ license agreement, allowing public domain dedication of data and data elements of scientific articles

A common misconception about implementing Creative Commons CC0 for data published within or alongside scientific articles seems to have been that it applies to all scientists’ data, not just those submitted to a journal. This goal pertains only to content which researchers already publish. By implementing a variable license agreement (with CC0 for data and a Creative Commons Attribution license for creative and written works), we would be asking authors to only apply different terms of use to some parts of what they already publish. Journal and publisher policies for the availability of all (that is, including unpublished) underlying data are important, but are a distinct issue, discussed as part of Goal 3.

The International Stroke Trial database, published by Sandercock et al. in Trials in April 2011, for example, includes a brief article describing a large clinical dataset, and the dataset is an accompanying CSV file. With a variable license agreement, the data (the CSV file) would be available for reuse without a legal requirement for attribution. Scientific norms of citation would still apply and, for any future aggregated use in, for example, a systematic review and meta-analysis, it would undoubtedly be scientifically (culturally) essential for the source data to be cited – even if not legally required – to ensure credibility.

The consensus of the group was that a feasible approach to the implementation of a variable license agreement would be to specify that, from a specific date, any author submitting to a journal/publisher agrees to dedicate the data elements of their article and supplementary material to the public domain. This was a key proposal in the BioMed Central Open Data statement, but the group felt that much more detail on the process, policy and implications of variable licensing for published articles and additional (supplementary) files is needed to enable implementation. It was agreed that the points to consider, in detail, should include:

●    The safeguards needed, if any, to ensure appropriate credit is given in the absence of legal requirements for attribution
●    The legal and technical developments needed to support and enable change
●    How to demonstrate the value of public domain dedication to different stakeholders
●    The frequently asked questions and concerns about public domain dedication
●    What do we mean by data?

This last point in particular generated much discussion, with differing views on the need to define data, and, if so, the granularity required. One approach would be to provide no explicit definition of data and see how it is interpreted by the community, as arguably those who will be (re)using open data at scale will already understand licensing issues. But not all data are equal; data can arguably be anything from individual data points collected during a clinical trial, to descriptive methods. While it will not be possible to encompass all data types, the consensus was that some basic guidance and specific examples could be useful for scientists. For example, publishers often receive additional data tables as PDF files, which would be much more re-useable if presented as CSV files.

BioMed Central has agreed to lead on drafting this second open data white paper, which will be made available for a defined period of public consultation.

Storage of data – where, what and for how long – was a common concern of the group. In some data-intensive fields, only the most relevant of the many terabytes of data being collected are retained. But some data, such as those generated during large clinical trials, can never be recollected. And these more ephemeral data would, arguably, benefit most from data sharing under more liberal public domain terms, as this would ensure maximum potential for (re)use. There was agreement that funders and scientists should consider these issues in preparing data sharing and management plans, which are increasingly being required by funding agencies. Also, the group agreed to work together to identify more areas of science where there are no obvious places and formats to store data, for example by reviewing resources such as the DataCite/BioMed Central repository list and the BioSharing list of data standards.

Goal 2: Consensus on the role of peer reviewers in articles including supplementary (additional) data files

Peer reviewer guidance on the websites of all publishers represented by the group currently include little or no information about reviewing additional files. One approach would be for additional datasets to merely be considered as an additional reference. But where a dataset is the primary purpose of a publication, such as in the example by Sandercock et al., the approach may need to be different, perhaps more towards integration with Dryad or similar services. Moreover, would publishing an article with a dataset that later turned out to be unreliable reflect badly on the peer reviewers?

While the consensus was that detailed guidance on how myriad data formats should be reviewed may not be necessary, it was agreed that more information should be added to peer reviewer guidelines of any journals that publish supplementary (additional) files, explaining the editors’ expectations of the role of peer reviewers regarding additional data files. Here, the instructions for reviewers for specific data types – e.g. in Trials and at Pensoft – may serve as a guideline.

Of the numerous expressions of interest in this working group meeting, the suggestions from Dr Egon Willighagen on the presentation of additional files were particularly relevant for discoverability and reuse of data files. Although BioMed Central has previously offered a search-able list of published data files to any scientists wishing to use them, there is more that can be done by publishers, and we are working towards additional features s
uch as exporting of data from article tables in alternative formats.

Goal 3: Sharing of information and best practices on implementation of journal data sharing/deposition policies

In addition to the three different approaches to data sharing policies and statements described in the meeting agenda, alternative and combined approaches were also identified – for example the journal Biostatistics’ kite marking of articles depending on their level of reproducibility, and the ‘Availability of supporting data’ section being implemented by some BioMed Central journals.

Data sharing statements in peer-reviewed journals were considered to be a positive development, and it seems they will become an increasingly common requirement.

The group agreed a need to better identify published articles where supporting data are freely available, for example in electronic tables of contents. But it is important to recognise that not all data can be openly shared so publications without this additional data functionality, when defined, should not necessarily be viewed differently by readers. Indeed, it was agreed that further studies on the impact of journal data sharing policies and statements in published articles, as has been carried out for competing interests, are needed to add further evidence for journal data sharing policies.

The role of journals and funders in the enforcement of data sharing policies was also discussed. Journals with mandates for data sharing in published articles have quite clear powers in ensuring consistent application in published articles. Funders, it emerged, are aiming to enforce data sharing policies more stringently.

Further involvement of institutions in encouraging and requiring data sharing was raised. Edinburgh University has implemented a data sharing policy (and repository), and better knowledge of these policies would be valuable for those wishing to obtaining data. It was agreed that a logical extension to the BioSharing list of data sharing policies would be to include institutions, and the Digital Curation Centre is also collating them. All are encouraged to contribute to these resources. BioMed Central will also, now, augment its resources on funders and institutions with polices on open access, to include information on data sharing policies.

Concluding remarks

Support for access to research data continues to grow, again recently at the UK Government peer review inquiry, so it’s vital barriers to sharing are overcome. There is a lot of excellent work already happening to promote and provide further evidence for the benefits of open data in science, but this group was convened to build shared understanding on three issues not currently being addressed. Furthermore, this meeting aimed to bring together different perspectives on open data, positive and negative, which is often hard to achieve in open science circles, and it managed to do this with some success. We’re pleased that a number of BioMed Central journal editors, not all of whom could attend the meeting, have been in contact about the data debate from fields as diverse as biomedical physics to clinical trials. Indeed, attitudes to proposals from the meeting may well differ between different scientific fields, and this was another important consideration to emerge from the meeting. Publishers serving a broad section of scientists are in an excellent position to share knowledge across domain boundaries, and any scientist wishing to express their views and collaborate on these initiatives are encouraged to contact BioMed Central.


Many thanks to the meeting attendees for their participation:
Alex Ball (UKOLN), Theo Bloom (Public Library of Science), Diane Cabell (Oxford Internet Institute), David Carr (Wellcome Trust), Matt Cockerill (BioMed Central), Clare Garvey (Genome Biology), Trish Groves (BMJ), Michael Jubb (Research Information Network), Rebecca Lawrence (F1000), Daniel Mietchen (EvoMRI Communications), Elizabeth Moylan (BioMed Central), Cameron Neylon (Science and Technology Facilities Council), Elizabeth Newbold (British Library), Susanna Sansone (University of Oxford), Tim Stevenson (BioMed Central), Victoria Stodden (Columbia University), Angus Whyte (Digital Curation Centre) and Ruth Wilson (Nature Publishing Group).

View the latest posts on the Research in progress blog homepage