Geographically Uneven Coverage of Wikipedia

/, Platform usage/Geographically Uneven Coverage of Wikipedia
The Geographically Uneven Coverage of Wikipedia
(Click to see full image)
Click to see full image

This map points out the highly uneven spatial distribution of (geotagged) Wikipedia articles in 44 language versions of the encyclopaedia. Slightly more than half of the global total of 3,336,473 articles are about places, events and people inside the red circle on the map, occupying only about 2.5% of the world’s land area.

Data

The map is based on Wikipedia data dumps encompassing 44 languages from November 2012. We excluded articles with more than four geotags, which typically consist of lists of geographic features. In the remaining data, we chose the most frequent geotag. If all geotags occurred only once, the first geotag (typically the most important one) was chosen as representative for the article. Additionally, we gathered article metrics such as number of characters and words in the article, the number of links to other Wikipedia articles, the number of external links and the number of in-article references. We mapped the article locations on top of a dataset that we obtained from Natural Earth using Buckminster Fuller’s Dymaxion map projection that has little distortion of shape and area and highlights that there is no ‘right way up’.

Findings

The map highlights the fact that a majority of content produced in Wikipedia is about a relatively small part of our planet. This finding supports previous work on the geographical biases of Wikipedia. Consider for example this visualization of the state of Wikipedia in 2010. We know that different language versions have varying shares of geocoded articles. English, Polish, German, Dutch and French are the Wikipedias with the largest numbers of geotagged articles. Since all these languages are spoken in Europe they may make a significant contribution to the dominant position of this continent in the above map.

By contrast, other continents are much less represented in the world’s most prominent digital repository of human knowledge. As we pointed out in the post about Africa on Wikipedia, the whole continent of Africa contains only about 2.6% of the world’s geotagged Wikipedia articles despite having 14% of the world’s population and 20% of the world’s land.

Further exploring the two groups represented in the map above (the inside and the outside of the red circle), we find that Wikipedia articles inside the circle have had a head start: they are on average a bit older than those outside. Especially in 2005 and 2006, editing activity about this European area picked up much faster than in the rest of the world.

Analysing the number of words per article shows that the articles inside the circle are a bit shorter than those outside. The average word counts per article are 419 inside the circle (median: 215) and 455 for the rest of the world (median: 260). To put this into perspective: these average word counts equate to the number of words you have read in this blog post up until this point.

While it is possible that this difference in word counts translates to differences in quality, one has to bear in mind that there may be other factors at play, such as variation in style and linguistic density or verbosity of the relevant languages in the respective areas.

However, at least within the English Wikipedia, we could show that in general word counts in large parts of Europe are indeed lower than those in North America. The comparison within Europe shows that articles about places, events and people in, for example, Italy and Great Britain are noticeably longer than those about such topics in France or Poland.

To obtain a clearer picture, we can analyse the distributions of languages, both inside and outside the circle. In doing so, we can clearly see that most major European languages (with the exception of English (EN), Russian (RU), and Spanish (ES)) have more articles inside the circle than outside.

By | 2018-03-22T12:33:17+00:00 March 22nd, 2018|