A method for studying historical newspaper datasets

The study “A data-driven approach to studying changing vocabularies in historical newspaper collections” by Simon Hengchen from University of Gothenburg, Ruben Ros from University of Luxembourg, and Jani Marjanen and Mikko Tolonen from University of Helsinki created two-step method for studying the topic of nation and semantically related words.

Although the concept of nation, nationhood and nationalism are widely studied topics, the literature of the past has been scarce when related to the language of nationhood. This study sought to address this gap by quantitatively tracing the long-term development of the vocabulary of nationhood. 

The dataset for the study consisted of  newspapers from four countries in four languages. The Dutch dataset was the Delpher open newspaper archive for the period of 1618-1876. The period of 1877-1899 was available only in API, but the authors queried this dataset too, although 100% recall is not guaranteed.

The Finnish dataset comprised both Finnish and Swedish language newspaper articles. The authors used the entire National Library of Finland sub-corpora for Finnish and Swedish Newspaper and Periodical Corpus. From Sweden, the corpus was the Kubhist 2 corpus, and the British corpora were British Library Newspapers, the 17th and 18th Century Nichols collection, and the 17th and 18th Century Burney collection.

The first part of the method was using dependency parsing to extract all nouns modified by an adjective pertaining to national whatever that is in each of the languages. Adjectives modifying synonymous nouns were also included.

The second part of the method involved the training of diachronic word embeddings using gensim, a Python library for vector space modelling. Once the word embeddings were trained, the authors built a similarity matrix for all the nouns they had extracted. 

The authors emphasize that to their knowledge, a comparably vast undertaking has not been attempted before. The strength of the method presented in the paper is that it is in no way limited to the study of nation and nationhood, but can be applied to a wider variety of different contexts. 

The study  “A data-driven approach to studying changing vocabularies in historical newspaper collections” by Simon Hengchen, Ruben Ros, Jani Marjanen, and Mikko Tolonen is in Digital Scholarship in the Humanities. (open access).

Picture: open book lot photo by Patrick Tomasso.

License Unsplash

Give us feedback