Interning with Ensemble
Last summer I worked alongside Ensemble developing a solution to extract information from botany journals using NLP (Natural Language Processing). This was an incredible opportunity to develop my own coding and research skills whilst solving real-world problems and getting a taste of the workplace.
What is Natural Language Processing?
NLP (Natural Language Processing) is a technology where a computer derives meaning from words and processes the information it finds; I have been applying this technique to extract information from historic and handwritten botany journals. This mitigates against human error in deriving information from old journals and also removes the need for someone to manually work their way through reading the journal.
The challenge with Botany Journals
Botany journals are quite complex and applying NLP on them is not such a straightforward process. They have all sorts of complicated syntax and do not obey normal linguistic ‘rules’. For example, ‘L.’ has no meaning within the context of this blog yet in the corpus we used, it could be part of a plant species; shorthand for “Lakeland”; part of someone’s name or simply a typo. Overcoming challenges like this is difficult, particularly if the human (me) trying to solve it is completely inept when it comes to biology or Lake-District geography.
“You shall know a word by the company it keeps”
The problem with texts derived from digitising handwriting is that they are also full of typos. Without a full dictionary of every word used it can be very daunting to correct, either with an auto-correct which can only guess at the best match, or by hand which over an entire corpus can take days or even weeks. This is where John Rupert Firth’s quote “You shall know a word by the company it keeps” comes in. How humans ascertain the correct spelling of a word is by understanding the meaning of the surrounding sentence. If I were to write about the building of a “hoose” you might naturally assume I’ve simply misspelt the word “house” unless I’m writing about the ancient Greeks and the city of Troy in which “horse” might be a more reasonable guess. The problem is that many words have radically different meanings despite quite close spellings. Nearly all available spellcheckers today have no effective way of applying this context.
Teaching computers to think like humans
The trick is, of course, to get a computer to process words in the same way a human might. We can use a process called ‘vectorisation’, which allows word meaning to be converted to a high-dimensional vector. High dimensional sounds quite terrifying, but it is quite simple: words are clumped together by their closeness in meaning. “Stupid” does not often come in the same sentence as “brilliant”, so they’d be on opposite sides. Whilst, words that are similar like “coffee”,” cappuccino” and “latte” have very similar meanings and so would be nearby to each other, as they are all largely interchangeable within a sentence. However, the need for dimensions -and lots of them – comes when you try and place our group of words around “coffee” in relation to a group of words like “stupid”. Whilst I am a huge advocate that “coffee” should be nearer in meaning to words like “brilliant” or “miracle”, in a sentence, “coffee” is not interchangeable with them: It has nothing really to do with either. Therefore, our model starts to acquire new dimensions to accommodate new meanings.
With just 2 vectors things could look quite simple:
Given that a word is placed in our model by the words that appear around it, we can reverse this to find words based on “nearby” words either within a sentence or by meaning. We can simply use the words we find nearby to approximate the vector of the word we want. At which point we must make the assumption that the misspelt word is less common than its correct spelling across the entire corpus and pick the closest, most common word, by a vector that has a close spelling.
What we have done is created a spell checker that does not need a dictionary, only a big enough sample of a given context; and it does not even need all the words to be correctly spelt! That’s pretty cool! As an example, if I were to say this blog was initially typed using “Micrsoft word” (despite how difficult it is to willfully defy spellcheck). With any luck we can detect typos in words appearing infrequently and containing odd combinations of letters (like “crs”). In the sentence we can look for where other words appear in the graph, “typed, spellcheck, word” might all be reasonably nearby to “Microsoft” within our model. Then it is simply a case of finding the nearest word by meaning that has the most letters in common.
This also solves the earlier problem of not knowing whether a single letter like “L.” is a name, plant, place or other. We can look up the vector of the surrounding sentence and work out what it is indicating. The beauty of this solution is that is moves towards a more universal approach to NLP. Current software available, such as Prodigy, labels entities or words with tags which a computer can then guess at. It doesn’t really actually help the computer understand things but rather just assigns more human-readable tags to bits of text. Whereas using a high-dimensional search space for all the words in a corpus, distinctions become more abstract: deciding whether something is a person, a place or a plant can be done by measuring how close they are in our model.
In the advent of computer vision, as we label physical objects, a computer might one day be able to describe an object by using these vector maps to work out distinguishing properties. Perhaps more excitingly, this approach is not language-specific. Theoretically, the same vectorisation and spellcheck techniques would work on a text in German, Latin, Russian or even a language no one has ever heard of.
I hope this brief snapshot of what can we achieved over summer, will highlight the exciting opportunity offered by internships. Going forward, I still enjoy the benefits of these insights and the hands-on experience applying these.
Author: Stephen Mander
Image used in header: Image from page 334 of “The American botanist : a monthly journal for the plant lover” (1901-1948), by Willard Nelson Clute
Thumbnail image: Image from page 365 of “The American botanist : a monthly journal for the plant lover” (1901-1948), by Willard Nelson Clute