Exploring historic plant biodiversity in the Lake District: the role of digital technologies in extracting knowledge from historic texts
In June 2018 Ensemble opened a call to fund projects relating to biodiversity and digital technology. We funded four projects and have asked each team to share a bit about their work. Below is the report from the project entitled “Exploring historic plant biodiversity in the Lake District”. They explored the use of computational methods to resolve challenges in accessing plant data collected over several centuries, which have been provided from different sources and in various formats.
For over 250 years the English Lake District has been considered a place of outstanding cultural and environmental significance attracting writers, poets, artists, naturalists, political commentators and tourists alike. In direct response to its sustained notoriety, the Lake District’s was designated a UNESCO World Heritage Site in 2017 under the category of a ‘cultural landscape’ (The Operational Guidelines for the Implementation of the World Heritage Convention. Retrieved August 20, 2009).
The UNESCO award has served to heighten public interest in the region. It has also brought into sharp focus the regions environmental past and how the natural landscape has changed over time, as well as how people have responded to this change. These questions have become of pressing concern for heritage and conservation organisations in the region, including the National Trust and the Lake District National Park, who are seeking to preserve, to protect and (in some cases) to restore the Lake District’s unique historical environmental character. With this in mind, a more detailed understanding of the landscape’s historical past is required.
This project has explored ways of collecting data on historical flora of the Lake District from historical textual source material. The United Kingdom has a very strong tradition in the accurate observation of plants extending back to at least the seventeenth century. Consequently, there are extensive records which we can draw upon, including scientific society Transactions and Proceedings, regional and national floras, scientific journals, letters, field notebooks and travel accounts. By enabling access to the empirical data contained within these sources, it is hoped that a better understanding of the region’s past biodiversity can be revealed, thus allowing for more effective conservation strategies to be developed.
However, although historical source material can provide rich insights into the historical flora of the Lake District, the material can be frequently difficult to work with. Observations on plants are often spread across multiple different sources, with many also being dispersed between writings on other subjects. This can make it difficult to identify all relevant information relating to flora quickly and accurately. Furthermore, the large volume of source material required to form an overall impression of the biodiversity of specific localities poses difficulties in how to consult and extract information from so many sources efficiently.
This project has brought together a collaborative team including ecologists, digital humanists and computer scientists to overcome these challenges. By bringing together expertise from these three disciplines we have applied state-or-the-art digital techniques, principally drawn from Natural Language Processing (NLP), to help unlock the potential of historical sources. We have sought to do so on three fronts;
- Use of computational techniques to identify and extract information on plants from textual source.
- Use of database tools and cloud technologies to correlate the extracted data so it can be queried and analysed and linked with other database platforms.
- Use of digital tools to visualise and communicate the complexities of historical biodiversity to environmental organisations and the public.
There is a lot of information sitting in such historical sources that is relevant to the environmental scientists – who want to know more about the types of plant species available in a given locality – and to historians – who want to make sense of the history of plants, locations and even the observers. The primary types of information that we aimed to extract from historical sources in this project are as follows: plant names, observers and locations. There were also other pieces of information that are relevant such as the geographical attributes of a given locality. For example, a particular plant can be found near a river in a given location. Extracting such details as to where the plant was spotted can be very challenging, but can also provide insightful information regarding the place where the plant was found. Figure 1 shows three types of entities in its legend – observer, plant name and location. The text is further highlighted to show a few examples of such entities that have been extracted. Figure 2 further shows the spatial relation and geographic attributes around a given location.
The extraction of information from the corpus (an existing collection of digital data about plants recorded over centuries) has been achieved using Natural Language Processing (NLP) techniques. Such extraction is possible via a Machine Learning algorithm that can understand the given text and identify the appropriate entities. In the case of the corpus used for this work (the one shown in Figure 2), given that there was no pre-trained algorithm that can identify plants/observers/locations, we had to create our own.
Once the data has been extracted, it can be stored on a database, and this opens different perspectives on how to use the data in applications. For instance, we stored the data on a No-SQL database (MongoDB), and used a graphical tool running on MongoDB, known as a Visual Query Builder, to query the data. The data was also used for plotting in GIS software. Furthermore, we created a linked data model using semantic techniques to link the corpus data to other datasets that may be useful for further analysis. For instance, the data was linked to the following datasets: synonyms of plant names database, a plant taxonomy database, and a geocoordinate database. Whilst the first two datasets served the purpose of enhancing the plant information, the geocoordinate database was used mostly for spatial querying. An example of a spatial query is to find plants that have been observed within a 5-mile radius of a given location. The development of this approach will allow us to explore our corpus much more effectively and techniques we had been using previously. The project also allowed us to develop important connections to others interested in this area and we intend to continue developing machine learning approaches to exploring the data we have collated with further funding applications.
Project team: Dr Carly Stevens, Dr Vatsala Nundloll and Dr Robert Smail; Lancaster University
The work of Ensemble and subsequent grants has been funded by the UK EPSRC as part of the Senior Fellowship in the Role of Digital Technology in Understanding, Mitigating and Adapting to Environmental Change grant no: EP/P002285/1.
We would also like to acknowledge Henry Ford, who worked as a plant biologist, and who provided us with data on plant species. We are very grateful to him for having willingly given us this data which he spent a long time manually entering on a database.