Geographic intersections of languages in Wikipedia
This graph illustrate the percentage of geo-referenced articles in the twenty editions of Wikipedia containing the largest number of geo-referenced articles.


The Terra Incognita project by Tracemedia investigates how Wikipedia has evolved over the last decade, mapping geographic articles, and date of creation, for over 50 languages. The maps highlight geolinguistic biases, unexpected areas of focus, and overlaps between the spatial coverage of different languages.

The project was developed using geo-coded Wikipedia articles from the Wikimedia Toolserver Ghel project (Geohack External Links), and article metrics that were collated using Toolserver scripts. The Ghel data dumps date to July 2013.

Only articles with primary coordinates are used, that is “where the location should be considered the primary object(s) in the page […]. Generally this should be one per article, but may be more with current corner cases with source and outlet of lakes and rivers” (Ghel project).

As illustrated in the featured graphic above (see tablebar chart by the Terra Incognita project), the percentages of geocoded articles in Wikipedia editions vary largely, from a minimum of 2% (Hindi Wikipedia) to a maximum of 46% (Polish Wikipedia), with the exception of the constructed language Volapük, whose Wikipedia edition includes a 79% of geocoded articles. Most large editions in Germanic and Italic languages contain between 12% (Italian Wikipedia) and 20% (English Wikipedia) of geo-coded articles.


The primary goal of the illustrations presented in this piece is to visualise how Wikipedia has very divergent geographic coverage in different languages. The tool also allows us to look at the date at which every one of the 4.5 million geocoded articles in Wikipedia was created: thus enabling us to see how the focus of different linguistic communities has evolved.

Most geo-coded Wikipedia articles are located in the countries where the language is listed as an official one.

One of the most interesting patterns that we can see in the data is that over 70% of articles written in that languages are spoken predominantly in a single country (e.g. Czech or Italian) only exist in that language. This means, for instance, that there might be articles about thousands of Czech villages written in Czech, but not English, French, German, or even Japanese.

Furthermore, Terra Incognita studies how two or more languages intersect with each other, when two distinct Wikipedia editions refer to the same location, in which is the proportion of such articles in the collections. These linking points can be visualized by means of language intersection maps, which highlight location referred to be more than one language.


The project was created by Gavin Baily and Sarah Bagshaw at TraceMedia, and was supported by funding from the Arts Council of England Grants for Arts and the National Lottery.

