A Little Bit spaCy

John Hurley
The Startup
Published in
6 min readJul 22, 2020

--

Named Entity Recognition for taxonomy with the spaCy NLP library

Photo by Jan Antonin Kolar on Unsplash

The Audubon Society’s Christmas Bird Count has been a tradition for 120 years. Started by ornithologist Frank M. Chapman in 1900, it was intended to encourage conservation by counting birds rather than hunting them. One thing that hasn’t changed much since then is that it involves a lot of manual paperwork.

As part of a larger effort to automate the annual Christmas Bird Count (CBC), I wanted to be able to create a definitive species list for a count circle given a PDF of a tally sheet, similar to this:

This article talks about how I successfully used the spaCy Natural Language Processing (NLP) library for Python to effectively automate this step of the process. I felt that it was of general interest, as I had found many “hand-wavy” articles about using spaCy to tag from a known list of technical terms, but which on inspection had ignored critical real-world issues.

Description of the Problem

There are a vast number of CBC circles and a wide range of technology used by each. Of course, if there is an existing spreadsheet or text document, extracting the species and sorting is greatly simplified. In general, the tally sheets available through count circle websites are in PDF format. Some of them, such as the Calero-Morgan Hill checklist above have clean lines and likely were printed from a spreadsheet. Others are more detailed, like this:

A section of the NYRC 2009 foldable checklist

The checklists are generally in taxonomic order and grouped by family such as “Grouse, Quail, and Allies”. There are often typographic indicators, such as the bold texts above, which indicate that the species is rare; in others this may be included in the text e.g. “Violet-green Swallow (rare)”.

When extracting text from a PDF however, what you see is most definitely not what you get, at least not in the same order as what you see visually. Combine that with abbreviations, misspellings and taxon changes, and it becomes surprisingly tricky to come up with a list that contains only the species on the list and nothing else. I will cover the extraction of text from a PDF using pdfminer in a separate article.

Using spaCy

spaCy is a Python library for Natural Language Processing that excels in tokenization, named entity recognition, sentence segmentation and visualization, among other things.

Prior to using spaCy, I had tried using a mix of ntlk, scikit-learn and old-fashioned regex to tackle this problem, and I think that I may still need to use some of these techniques going forward to handle some of the interesting ways that species are presented on the checklists.

The basic flow of the code is this:

  • Instantiate a pretrained model for English (nlp = English())
  • For each word in the taxonomy, build a pattern with the word and the category/tag, e.g. CommonName or FamilyScientific
  • Add these patterns to an instance of EntityRuler
  • Add the EntityRuler to the nlp pipeline
  • Process the text

The core of the code is below, with some details omitted for clarity. Refer to the code on GitHub for a full working version.

def spacify_text(text, taxonomy):    nlp = English()
nlp.add_pipe(quote_merger2, first=True)
nlp.tokenizer.rules = {key: value for key, ... #omitted ruler = get_entity_ruler_cached(nlp, taxonomy)
nlp.add_pipe(ruler)
doc = nlp(text.lower())
return doc

The function get_entity_ruler_cached creates the patterns and saves the EntityRuler to disk, or loads from disk if it exists. This creates a “doc” object that we can enumerate to extract the list for each tag.

One of the key advantages of spaCy is the visualization component. All of the code needed to either show the visualization in a Jupyter notebook or generate HTML is shown here (I omitted the color definitions so you can come to your own idea of what fuchsiaish looks like — or check the repo):

def create_visualization(docx, show_in_jupyter=True):
# Create visualization
colors = {
"COMMONNAME": purplish,
'SCIENTIFICNAME': aquaish,
'ORDER': greenish,
'FAMILYCOMMON': yellowish,
'FAMILYSCIENTIFIC': fuchsiaish
}
options = {"ents": ["COMMONNAME", 'SCIENTIFICNAME', 'ORDER',
'FAMILYCOMMON', 'FAMILYSCIENTIFIC'],
"colors": colors}
html = displacy.render([docx], style="ent", page=True,
jupyter=show_in_jupyter, options=options)
return html

The output when run on the article The 119th Christmas Bird Count Summary is shown below:

Partial results from code run on the Audubon 2019 CBC Summary

There are a couple of interesting things to note here. The first is to note that we want to discard anything not tagged “CommonName” (aka species) in our checklist, because only a species can be observed in the field. This means that we want to discard text like GREBES/PELICANS/CORMORANTS or GEESE/DUCKS as seen in the CACR checklist at the start of the article. I had previously done this manually with some code using the inflect engine to detect plurals.

Secondly, spaCy did the right thing and did not group together “black-browed albatrosses” (Thalassarche melanophris) and yield a species but saw that “albatrosses” was a family common name.

Even though we are ultimately discarding these, being able to label most of the items in the list makes the unidentified ones stand out. A section of the output from the Calero-Morgan Hill run is here:

Snippet of displayCy output

We have three unidentified phrases here. We immediately discard “party hours…” as a non-species phrase. At the bottom is “accipiter, sp” which is not in the taxonomy but if we do some transformation on it we get the correct (and recognized) “accipiter sp.”. Lastly, as of 2014 “Nutmeg Mannikin” is known as “Scaly-breasted Munia”. To solve this particular problem in general, a list of taxon changes could be applied prior to name entity recognition.

Issues

I have made this all seem as rosy as, well a Rosy-faced Lovebird, but there were a few issues that I had to sort out to make it work for this problem. The main issue was tokenization of species with slashes or possessives. The poster child for this is

Western/Clark's Grebe

A CBC counter would enter this on their checklist if it was not possible to tell if it was a Western Grebe or a Clark’s Grebe. I spent a fair amount of time messing around with nlp.tokenizer.rules and doc.retokenize thinking that the possesive (i.e. “ ’s ”) was the issue. The real problem was that I had used NLTK’s wordpunct_tokenize for creating the EntityRuler patterns from the taxonomy but nlp.tokenizer for the text. Using nlp.tokenizer consistently was the fix for my issue with recognizing our sometimes indistinguishable Aechmorphus sp., at least in a checklist if not in the field.

Conclusion

I found spaCy to be a very capable and well-documented library. As always, handling user-generated data can get messy. A fully working code project can be found in my GitHub repository.

--

--

John Hurley
The Startup

Mathematician, data scientist, equestrian, photographer, birder. I enjoy looking for patterns. https://www.linkedin.com/in/johnhurleyphd/