A method for studying historical newspaper datasets

December 14, 2021

Tags: corpus analysis, Denmark, Finland, historical, historical newspapers, methods, nation, national, newspaper articles, semantics, Sweden, UK

The study “A data-driven approach to studying changing vocabularies in historical newspaper collections” by Simon Hengchen from University of Gothenburg, Ruben Ros from University of Luxembourg, and Jani Marjanen and Mikko Tolonen from University of Helsinki created two-step method for studying the topic of nation and semantically related words.

Although the concept of nation, nationhood and nationalism are widely studied topics, the literature of the past has been scarce when related to the language of nationhood. This study sought to address this gap by quantitatively tracing the long-term development of the vocabulary of nationhood.

The dataset for the study consisted of newspapers from four countries in four languages. The Dutch dataset was the Delpher open newspaper archive for the period of 1618-1876. The period of 1877-1899 was available only in API, but the authors queried this dataset too, although 100% recall is not guaranteed.

The Finnish dataset comprised both Finnish and Swedish language newspaper articles. The authors used the entire National Library of Finland sub-corpora for Finnish and Swedish Newspaper and Periodical Corpus. From Sweden, the corpus was the Kubhist 2 corpus, and the British corpora were British Library Newspapers, the 17th and 18th Century Nichols collection, and the 17th and 18th Century Burney collection.

The first part of the method was using dependency parsing to extract all nouns modified by an adjective pertaining to national whatever that is in each of the languages. Adjectives modifying synonymous nouns were also included.

The second part of the method involved the training of diachronic word embeddings using gensim, a Python library for vector space modelling. Once the word embeddings were trained, the authors built a similarity matrix for all the nouns they had extracted.

The authors emphasize that to their knowledge, a comparably vast undertaking has not been attempted before. The strength of the method presented in the paper is that it is in no way limited to the study of nation and nationhood, but can be applied to a wider variety of different contexts.

The study “A data-driven approach to studying changing vocabularies in historical newspaper collections” by Simon Hengchen, Ruben Ros, Jani Marjanen, and Mikko Tolonen is in Digital Scholarship in the Humanities. (open access).

Picture: open book lot photo by Patrick Tomasso.

License Unsplash.

Related posts

What Does the Audience Want from Data Journalism? An Exploratory Study About Disclosure Transparency

“Stories for Girls”: Gendered Newspaper Reception and the Reputation of Kate Chopin

Beyond Credibility: How News Topic and Cognitive Processing Shape Responses to AI-Authored Journalism

Research of June 2026

Reading, writing, rumour: press readership and the making of war knowledge in Australia 1914–1918

Capturing the fourth estate: A case study of Bangladesh news media

Propaganda in Pre-Soviet Caucasian Press at The Example of Comparative Analysis of Georgian Newspapers: ‘Ertoba’ (1919) and ‘Sakartvelo’ (1919)

“Dying in harness:” How news workers’ obituaries in the 20th century served as meta-journalistic discourse about trauma and coping

Privacy before Campbell