DFW and FAIR Agriculture Data: the Knetminer Experience
For those who really enjoy personal computers, and who have the skills, they may have chosen to assemble their own PC. So, instead of buying an off-the-shelf PC, one could spend days choosing the best components, from powerful CPUs to fast hard disks and then more time putting the components together, “assembling” the perfect PC. That experience has been made possible thanks to something that affects many human activities: standards and interoperability.
Standardised components are interoperable: one can buy compatible PC components from completely different manufacturers then simply plug one into another. Even more flexibly, developers can write software with their PC and be sure that it will run unchanged on a large variety of compatible (ie, standardised) hardware.Standards are well known in agriculture too, from the simple hose connectors like Hozelock, to complex communication protocols for farming machinery like ISOBUS. Hence, adopting data standards in areas like plant biology, farming, or food seems a natural extension of that. On a software side, agriculture ontologies, AGROVOC, or the BrAPI interface are a few examples of data standardisation efforts in this field.
Digital Standards in Agriculture (and DFW)
As a large cross-institute research program, Design Future Wheat provides an natural opportunity for data standardisation. Browsing the data resources that we are making available, one can see several examples where interoperability allows for better integration of such resources.
For instance, the Grassroots system is being used to offer a set of federated data services, which integrates WheatIS, a similar project to integrate data resources in a lightweight way. Essentially, this is based on the combination of three things: indexing technology (Elastic search), common terminology and a common programmatic interface (API). In turn, WheatIS integrates BrAPI, a well-known API to access breeding experimental data. Our KnetMiner is another resource that participates in the WheatIS federation, through the GnpIS project.
As another example of data integration, several DFW resources are aligned to the RefSeq wheat genome assembly, which is part of the ENSEMBL resources. This is used in Grassroots, by CerealsDB, the Bristol University resource for wheat genomics, and by our KnetMiner data sets. Referring to such a common genome assembly facilitates specific data set integrations, such as the project to link QTL markers with genome and phenotype data, involving CerealsDB and Grassroots.
Many projects like the above start with collecting plain data files, transforming them and building applications on top of them. That’s one of the reasons why it’s important to publish data dumps too, as DFW has done with the publication of a data portal, based on the very popular CKAN software.
Meet the Knetminer Data Endpoints
Using Knetminer, we can build up an application starting from a variety of different plant biology-related files to offer knowledge exploration functionality. We can do even more: a while ago we started to fully realise that the kind of data integration we do in practice consists of creating knowledge graphs, ie, large amounts of interrelated pieces of knowledge, focused on explicitly representing knowledge structure (eg, properties like a gene name or an article’s title) and knowledge connections (ie, the protein encoded by a gene, the experiment the article is about). The natural next step was to publish knowledge graphs using the most common standards used for that.
So, now we can introduce the new Knetminer data end points. There are two types of them: one makes use of both Semantic Web technologies and the linked data approach, to provide a SPARQL endpoint. In practical terms, that means you have an SQL-like language that you can use to query graphs of knowledge, to explore the same data that power the Knetminer application, just in different, flexible, and creative ways. We have described this more in this other post, including Jupyter/Python examples of what you can do with it as a developer/analyst.
Because linked data isn’t the best solution for all data access problems (and has not been extremely popular so far…), we have decided to export the same data to Neo4j instances, which gives access to the Cypher query language. SPARQL and Cypher have complementary pros and cons. For example, SPARQL is a query language particularly suited to explore and manipulate standardised data published on the web, while Cypher has a very compact syntax on top of easy-to-setup tools. Moreover, the Cypher/Neo4j model, the so-called property graph model, facilitates the representation of “contextualised” relations, like in “this gene is related to this disease with p-value e-10”.
Interoperability, again
For these reasons, we offer practically the same data via these two different technologies. Which takes us back to the interoperability idea: multiple data formats and data access technologies can be accessed in a uniform way, thanks to the fact everything is based on common models and schemas. In summary, using common names like ‘Gene’ or ‘encodes’ across different access points makes that interoperability possible.
We are achieving this in Knetminer in a few different ways. Firstly, our data comes from existing data like the ones mentioned above (eg, Ensembl, UniProt, PubMed). This includes common and complex life science ontology annotations (eg, Gene Ontology, Plant Ontology). Secondly, we are mapping our simple knowledge graph schema to a number of common schemas and ontologies. And thirdly, we are working on the Agrischemas project.
Integrating agricultural data: Agrischemas
Agrichema was born from looking at the need to unify various DFW and agriculture data into a common representation. The main motivations for it are two-fold. First, in this field (no puns intended!), we have to deal with a variety of schemas, models and ontologies, dedicated to different aspects of agriculture and plant biology research, such as field trial experiments, biological entities, literature, meteorology. However, there isn’t much that tries to join these different aspects together, to allow for queries like “tell me which genes are expressed in field trials on rainy areas”.
Second, we see a need to complement advanced, ontology-based data modelling with something that is much simpler and more lightweight. Others have already had the same ideas: so, a couple of years ago a number of organisations, especially search service providers, gathered together and created the schema.org project, a general data schema focused on helping with annotating web pages with useful metadata. This helps systems like a search engine to “understand” what the page is about.
It goes beyond the original focus: in fact, the result is also useful to define data sets like Knetminer knowledge graphs, where the focus is more on exploratory research, rather than second-stage more refined analysis, such as ontology enrichment or reasoning-based data classification.
Indeed, this is so relevant for the life science domain, that a specific project to extend schema.org to this domain has started a couple of years ago: bioschemas.
We are developing agri-schemas as a further extension of bioschemas and schema.org, with the approach of reusing as much as possible of what already exists. We have been doing that in a rather bottom-up way, taking several use cases that emerged (mainly) in DFW and generalising what we need from there.
What have we gathered so far? Along with the draft in our github repository, we have uploaded real-data examples on our SPARQL endpoint, which are about plant gene sequencing experiments, as found in the EBI Gene Expression Atlas.
Should I care?
If you are a bioinformatician, or agricultural application developer, definitely yes! Our Jupyter examples give you examples of what you can do as a programmer or data analyst.
If you are a data manager and you want to integrate your data with DFW data, and/or you want to publish your data sets, that’s another very interesting case where you should get in touch with us!
Acknowledgements
In addition to DFW as the founder of the work presented above, we are grateful to Robert Davey’s group, for providing the resources from the Cyverse UK infrastructure and helping with their set up and maintenance.