Biodiversity Information Science and Standards : Conference Abstract
|
Corresponding author: Andra Waagmeester (andra@micelio.be)
Received: 05 Apr 2019 | Published: 18 Jun 2019
© 2019 Andra Waagmeester, Lynn Schriml, Andrew Su
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation: Waagmeester A, Schriml L, Su A (2019) Wikidata as a linked-data hub for Biodiversity data . Biodiversity Information Science and Standards 3: e35206. https://doi.org/10.3897/biss.3.35206
|
|
Wikidata (http://www.wikidata.org) is the linked database of the Wikimedia Foundation. Like its sister project Wikipedia it is open to humans and machines. Initially primarily intended as a central repository of structured data for the approximately 200 language versions of Wikipedia, Wikidata currently also serves many other use cases. It is an open, Semantic Web-compatible database that anyone can edit.
Here, we present the Gene Wiki initiative. In 2008, this project started by creating Wikipedia articles for all human genes (
This structured information is added to Wikidata, while active links to the primary source are maintained. Adding a new resource to Wikidata is a community-driven process that starts with modelling the subjects of the resource under scrutiny. This involves seeking commonalities with similar concepts in Wikidata and, if none are found, new are created. This process mostly happens in a collaboratively-edited document (i.e. GDocs), where different graphical networks are drawn to reflect the data being modelled and its embedding in Wikidata. Once consensus has been reached, the model typically exists in a human-readable document. To allow future validations of these models on existing data, it is converted in a machine-readable Shape Expression (ShEx) (
Once a semantic data model (as Shape Expression) is found, i.e. community consensus is reached, a bot is developed to convert the knowledge from the primary source, into the Wikidata model. While Wikidata is linked data (part of the semantic web), many life science resources are not. On the contrary, many distinct file formats or API output formats are used to present life-science knowledge. To convert between these different formats, bots need to be developed that are able to parse the different resources and serialize into wikidata. We have developed a software library in the Python programming language, which we use to build these bots. Once created, these bots run regularly to keep Wikidata up-to-date with knowledge on genes, proteins, diseases and drugs.
Having scientific knowledge represented in Wikidata comes with benefits. First, having research data on Wikidata increases its sustainability. When research projects end, their findings now remain on an independently funded infrastructure. Having someone else dealing with an infrastructure for a data commons also relieves the research community of having to do it themselves, leading to more time to focus on doing research
As a generic public data commons, Wikidata allows public scrutiny and rapid integration with other domains. Inconsistencies or disagreement between resources become more visible, due to the unified data models and interfaces.
The latter we leverage as a feature in our bots. One of our core resources is, for example, the Disease Ontology (
Although our bots focus on molecular biology, our approaches are generic in onset that we are confident a similar approach can work in biodiversity informatics.
wikidata, wikipedia, molecular biology, crowd-sourcing
Andra Waagmeester
National Institutes of General Medical Sciences R01GM089820