The rapid growth and availability of large volumes of scientific data is commonly associated with genomics but has an increasingly important role to play in other biological disciplines. One such field that is set to change with the emergence of ‘big data’ is taxonomy. In order to make the most of taxonomic data, a comprehensive infrastructure is needed to compile lists of organism names for ease of access and sharing. Current US and European Union laws state that scientific names of organisms and their compilation into lists lacks the creative aspect that would make them subject to copyright. However for those who compile these lists, the issue of intellectual property rights still lacks clarity and is a topic of heated discussion. David Patterson from Arizona State University, USA, Donat Agosti, President of Plazi, Switzerland, and colleagues, tackle this problem in their recent study in BMC Research Notes, where they present a ‘blue list’ of elements they consider to be common components of a taxonomic list. These familiar elements by definition are not subject to copyright as they lack a creative aspect. Here Agosti explains how the blue list can help promote data sharing by clarifying what material is subject to copyright and what therefore is not, as well as how biological disciplines can tackle the emerging problem of managing big data in the digital world.
What led Plazi, a non-profit association supporting and promoting openly accessible digital bio-taxonomic literature, to get involved with the debate on copyright issues in taxonomy?
We think that traditional disciplines, such as taxonomy, will inevitably change with the big data revolution. In that world, researchers will have better access to existing data and so they can proceed more quickly; and their insights will be more authoritative. Along with benefits in speed, efficiency, and quality will come new styles of research and new insights. We need to prepare for this era. At Plazi, we are committed to the open access to content upon which the success of this transformation will depend. It was clear to us that many members of the community did not have a good or consistent understanding of copyright and this got in the way of open access.
We have a special interest in scientific names of organisms. They have been used in a consistent fashion for over 250 years in documents, ledgers, databases, and other sources of biodiversity data. This makes them irreplaceable in organising biodiversity information. Initiatives such as the Global Names project, which seek to exploit names in information management, need to have access to all names ever published if they are to build a universal names-based indexing infrastructure. Yet we found disagreement and confusion, among those who could provide names, on copyright or other intellectual property rights. Some were willing to make content freely open for re-use, but others were aggressively defensive of their right to set conditions on re-use. Users were uncertain as to whether they could re-use names from other sources, and did not wish to offend colleagues by doing so.
We found that two major reasons for differences of opinion was a lack of awareness of the relevant law, and an inability to interpret the laws within a legal context. An additional factor is that the community is not motivated to build a common infrastructure that can serve all needs, rather the research paradigm favors stand-alone products over which the scientists (or their institutions) make competitive (non-co-operative) claims.
We addressed these matters in a workshop organised by the Global Names project at Arizona State University, USA, in April 2013. The workshop was attended by data managers, taxonomists, intellectual property lawyers from Europe and the US, and the Creative Commons Foundation. We also invited providers and users of names to make submissions for our consideration, and sought broad input through the Taxacom Listserv list.
What are the key issues that you wanted to address with your study?
Our agenda is to promote open sharing and re-use of scientific data. One reason that is given for not sharing content such as names, is that the content is owned by an individual, or by an institution, or a project. Whether because of they believe their work is like a painting, or because they believe in the economic potential of the names, or because they want due recognition for their efforts, many agents claim that their content cannot be re-used because it is subject to copyright. This is surprising because those who compile names and other taxonomic information rely on the efforts of hundreds or thousands of individual taxonomists, or on information published by journals, or material located in museums and herbaria, or on analyses and syntheses by colleagues past and present. The science of charting the world’s biodiversity relies on this huge corpus of prior work. The descriptions of species extending over 250 years back are to be found in the scientific literature. Of equal value as the published works that define the use of a name, are the physical specimens. These are housed in museums and herbaria, often financed – like the work of the scientists – by public funds.
The claims of copyright, we have found, rely on erroneous perceptions of copyright. Often, copyright is perceived as a kind of reward for intellectual effort. But in legal reality, copyright is a legal instrument that serves to protect individual artistic or literary works against slavish reproduction or other forms of unauthorised re-use. The misunderstanding of the concept of copyright hampers the exchange of taxonomic information. Access to much taxonomic information is impaired by license barriers, threats, paywalls, and complicated and/or impracticable re-use restrictions. Although these impediments mostly lack legal justification, they result in uncertainty that deters re-use and delays scientific progress. We hope that our report will help taxonomists, data managers, and those who want to re-use biodiversity data to better understand the correct use of copyright arguments. This will remove some unnecessary hurdles to scientific collaboration, and will allow biodiversity science to progress into the ‘big data’ world.
Are the issues around intellectual property rights a problem unique to taxonomy or do they apply in other biological disciplines?
This problem is not unique to taxonomy; we can see signs of it to a greater or lesser extent in other sectors of biology. It is particularly damaging in taxonomy as this is the core discipline that provides an organisational infrastructure for biodiversity information. Yet, taxonomists, who are often fearful for their own futures and that of their discipline, would find themselves much more relevant and valued if taxonomic data and information were openly and freely available, and transformed into a virtual infrastructure for the biological sciences.
How can biological disciplines overcome the problems associated with managing ‘big data’?
Progress towards ‘big data biology’ will, in our view, have a modular architecture. Each subdomain will develop its own system of standards for sharing data, and enterprises will develop as nodes that take responsibility for gathering together relevant content, transforming it to a common standard, and connecting it to other proximate domains. The nodes will act as a one-stop shops though which end users, whether people or machines, can gain access to the content. Enterprises such as GenBank and the Global Biodiversity Information Facility (GBIF) are good illustrations of this approach. The biodiversity sciences have accepted the need for a names-based infrastructure that contains all names that have ever been used for all taxa and that provides data and services that are freely and openly accessible to everybody who works in biology. The Global Names Architecture has been established to act as a node that will link all the variants of names, from the first, original use controlled by the Codes of Nomenclature via the many synonyms or misspellings to the currently accepted name. As it matures, it will allow each name to be linked to the treatment in which the name was first published, to the original publication, to the data upon which the name is based, and to any subsequent information published by anyone.
Progress will not happen consistently across the full spectrum of biology, but will advance and transform depending on need, use cases, champions, investment, and technology. Two other factors are critical. One is to set aside the research paradigm that produces many groups working for short periods on the same topic in favour of an infrastructure paradigm that brings all players, friends and foes, together to support an infrastructure from which they may all benefit. Lastly, there is the need to change the mindset of the scientists so that data is made ready for re-use from the outset, and to have a mechanism in place to ensure credit is given for re-use of content. With a credit mechanism in place, those with data are rewarded by participating in a co-operative ‘big data’ world.
What solutions do you see to ensuring correct and proportionate attribution when sharing and/or re-using biological content in the digital world?
First of all, we must separate the concepts of attribution from the concept of copyright. It is an important part of the scientific code of conduct to give credit to the findings and efforts of others. In the digital world, we must ensure that recognition is given to all sources of information. But this obligation is completely independent from the issue of whether the source is protected by copyright or database protection or any other intellectual property right. Mechanisms are beginning to emerge that can be used to assign credit to sources of information and to all of the players who collect and distribute information and on whom we also rely on for access to information. But sources need to get most credit. There are emerging mechanisms appropriate to the ‘big data’ world by which credit can be assigned to sources. They include annotation systems such as Filtered Push. They rely on using UUIDs (Universally Unique Identifiers – a 32 element alphanumeric string) for every datum element. Browsers and users would have small software plug-ins that track the input, use, and redistribution of any data elements. Every time there is a transaction, the system can reward all members of the supply chain with a unit of credit, and convey this information to a central registry. This favours players near the base of the supply chain. Data creators gain more credit than those who help to convey information from them to end users. Such a system can be used to annotate content, enriching the content by providing more relevant information or by correcting errors .
Can you briefly explain what the ‘blue list’ you developed is?
The ‘blue list’ is our attempt to identify elements in taxonomic information that, because of their factual nature or because they are familiar components in taxonomic publications, lack the creativity that makes copyright applicable. This list applies to checklists, classifications, taxonomies, and monographs. Despite the intellectual effort required to acquire the data, elements of the blue list are not subject to copyright and may be freely re-used unless restricted by a use agreement. By publicising this list, we hope to remove at least a certain part of the self-imposed barriers to taxonomic information exchange. A complete list of current elements on the blue list can be found here.
After its publication, we received several objections to the blue list. The criticisms did not show any flaw in our interpretation of copyright law. Rather, they emphasised the originality and effort required to compile taxonomic information, and suggested that this justified the application of copyright law. This argument is flawed because it does not distinguish between data elements from a complete work; nor between the creativity of effort to acquire data information and creativity of presentation in the sense of, and as required by, copyright law. So, a brush stroke in burnt umber in Van Gogh’s ‘Sunflowers’, a B flat in the Eroica Symphony, or the term ‘old sport’ in the Great Gatsby are not subject to copyright considerations. Nor are bits of information such as ‘length range 11-23 µm’, ‘carpels from rose to white’ and ‘spores lacrymariose’. It is this kind of information that is being addressed with the ‘blue list’.
How can the ‘blue list’ help promote data sharing?
This exercise contributes to an under-visited aspect of the big data vision; a component to the discussion of intellectual property. By seeking input from biologists and intellectual property lawyers, we now know that many elements of taxonomic treatments do not qualify as creative work, and that many biologists have been previously unaware of this. Our report, we hope, will help to redirect a misguided discussion over legal aspects of the information exchange in science, and will create an environment that will promote release and re-use of content.
The Creative Commons license suite version 4.0, was recently updated to include the addition of sui generis database rights, with the aim of rewarding authors for the work put in to compiling a database. Do you think this may provide a solution to some of the problems around attribution in the field of taxonomic ontologies?
We don’t think that the update is aimed at “rewarding authors for the work put in to compiling a database”. The European database protection, a concept found only in the European Union (EU), was introduced in 1996 and consequently transformed into national laws in the EU member states. It has an economic justification, which is to improve the return on investments put into a database where “there has been qualitatively and/or quantitatively a substantial investment in either obtaining, verification or presentation of the contents” (Article 7 of Directive 96/9). It protects the investor, not the compiler. The Creative Commons licenses up to version 3.0 did not address issues of database protection. As a result, on certain occasions the Creative Commons licenses allowed use of data which at the same time was forbidden by database protection. Version 4.0 brings the licenses up to date with the legal situation in EU member states. As with all other Creative Commons licenses, they are a form of data use agreement, now addressing not only copyright, but the European sui generis database protection.
As in the case of copyright, we should separate the question of database protection and the concept of attribution. ‘Attribution’ is better seen as part of the scientific code of conduct and is independent of the question of if a database is protected by the sui generis database right or not. So, our answer to this question is ‘No’. Indeed, it would be very worrisome if this license were called upon to achieve attribution. Rather, those who require attribution should work together to implement an annotation system that can provide credit to the use and re-use of elements of an ontological system, in which each element is identified through a UUID (Universally Unique Identifier).
However, the CC-BY-license 4.0 is helpful because it now clearly states that the owner of a protected database available under CC-BY-license will not claim database rights even if a substantial part of it is extracted and re-used. The more often that this license is applied to databases containing taxonomic information, the easier it will become to build the universal names-based infrastructure that we urgently need in order to organise biodiversity information.
Questions from Elizabeth Moylan, Biology Editor for BioMed Central.