Toponym Extraction Case Study:
Using named entity recognition to extract geographical information from newspapers
Author: Lawrence Vriend
This case study shows an example of the kind of analysis that can be done when using named entity recognition (NER) to annotate texts with geographical entitities. The code and scripts that were used in this analysis can be found in the GitHub repo.
“The press, as we now know it, has grown and evolved rapidly over the last 200 years (particularly since 1850), generating an ever-growing flood of geographical information.” (Harvey, 2005, p.230)
Newspapers inform us about what is happening in the world. In this sense they are conveying to us geographical information. One tool we have at our disposal for reasoning about our world are toponyms, i.e. place names. This raises the question how much newspapers refer to places through toponyms.
Let’s try to find that out by analyzing Dutch newspaper coverage on Brexit.
To start we select all articles in 2017 containing at least one mention of the phrase ‘Brexit’ in four Dutch newspapers:
In total 1,830 articles were found meeting these criteria. The articles that were used in this case study can be found here. The content itself is copyrighted, so unfortunately the dataset contains only metadata and not the actual articles themselves. Check out these word clouds though to see which lemmata are most prevalent in the respective newspapers. The Brexit has been in the news throughout in these four news papers:
Now we need to get hold of list toponyms that we can use in the named entity recognition. Two freely available resources that we can use for this purpose are the GeoNames and REST Countries datasets. The map below shows all data points that were selected from the GeoNames dataset for the task at hand. The dataset contains 45,300 toponyms in total.
Using spaCy's
named entitiy recognition functionality we can then collect and count all toponyms in the newspaper articles. Perhaps unsurprisingly only a fraction of the total toponyms are recognized:
Toponyms | REST/GeoNames | Found in dataset | % Found |
---|---|---|---|
Countries | 268 | 151 | 56.3 |
Cities Friesland | 27 | 10 | 37.0 |
Cities NL | 474 | 94 | 19.8 |
Cities UK | 1,532 | 102 | 6.6 |
Cities world | 43,237 | 240 | 0.5 |
The interactive graphs below shows several measures for the found results. Click on the legend labels to zoom in on a specific newspaper. Select multiple newspapers by shift-clicking.
The analysis also tells us specifically what places were mentioned in the articles. By plotting that information on a map we can see the geographical distribution of news on Brexit. Explore the results in this interactive map. The circle sizes represent the number of times a toponym was found in the dataset. With the layer control you can switch on/off layers. Click on a circle to open a tooltip with more information on the toponym and the frequency with which it occurs in the text and articles.
Just by looking at the map it becomes clear that some toponyms occur much more than others. Check this graph to find out more about these distributions.
Some insights that can be gleaned from this analysis are:
Harvey, D. (2005). “The sociological and geographical imaginations”. In: International Journal of Politics, Culture, and Society 18.3-4, p. 211–255.