I can’t believe, looking at my post history, that the last one was a shameful seven months ago. In the meantime I’ve been writing a chapter, marking my first exams, and attending a two week program in network analysis at the Folger Library in Washington DC. It was pretty incredible, and I’ll probably elaborate a bit more on some of the stuff we covered once it’s sunk in a bit more. In the meantime, I wanted to share some fun research which was partially inspired by, or at least motivated by, the two weeks there.
The network analysis stuff I’ve been doing, based on actually sitting down and painstakingly reading through lots of news stories and recording the connections, is useful, but in an ideal world I’d like to be able to sit back and say that the six months or more I spent on that was actually sort of pointless.. Not because the work that came out was not useful, but because I have developed a better way to automatically extract the same information. Reading everything in a world with so much information available isn’t sustainable. You can just about do it – Joseph Frank spent his life reading just about every single early newsbook, but now we’re all pretty much pressured into researching and publishing in a matter of months rather than years, and there’s always going to be a point where you are just overwhelmed by information, How about the rest of the century? Or the eighteenth, when so much more was published, and stories were recorded in a similar way? And what about international news? And what about manuscript newsletters?
If we want to really understand early modern news in its entirety, that is to say, if we want to see how it ebbed and flowed over time, and how various networks overlapped and changed depending on time and space and politics and current events, I think there has to be another, faster way to record these connections. I haven’t quite got there yet, but I wanted to share some work which at least can draw some simple geographical maps based on transcriptions of printed news. In this way, we might be able to at least begin to ‘map the production of space‘, in the manner of Cameron Blevin’s work on 19th century Houston newspapers. This kind of automated data extraction, while basic, has the additional benefit of speed, and allows the possibility of drawing quick conclusions and comparisons. I love maps. I love how they are creative, flawed endeavours, based on the input of the creator but also the data that they use or feed to a computer. The post below will show some of the ways you can use similar data, or data extraction methodologies, to get very different results.
Extracting place-names wasn’t particularly difficult. The first thing you need are some transcriptions. Luckily, the reliable Lancaster newsbook corpus works perfect here, and is easily the highest-quality newsbook transcription easily and publicly available. It’s great for ‘proof of concept’ stuff, even though I don’t really think the content itself is particularly representative. It comes in a zip file, with individual transcriptions of issues of newsbooks from the mid 1650s, in XML format. The first thing I had to do was prepare the files for analysis. I needed to get them into one big text file, then strip out the XML code (or use it for something, which I’ll come to in a bit). The Python library ‘Beautiful Soup’ was made for this, and makes it easy to turn the files into basically a big blob of lowercase text, ready for analysis.
Making a big joined-up XML file was the easy part: all you need is the ‘cat’ command in the terminal on Mac OS/Linux. You can very quickly join the contents of a folder into one file using
cat * > merged-file
although of course then you end with all the extra XML statements, which I got rid of using a text editor and find/replace. Crude, but it worked.
The XML transcriptions have got some pretty useful TEI markup, which is definitely worth trying to use. The text is marked up for presentation rather than semantics, that is to say, in general, it will tell you which parts of the text constitute a title, or in italics for emphasis, but not whether something is a location or a person, unfortunately. I started with a small experiment: extracting all the emphasised text, which gives most of the locations and people (proper nouns) in the text. This in itself was a useful starting point, but would require a good deal of data cleaning to pick out the location data. I started thinking about other ways to get meaningful data from these basic (but wonderfully accurate and complete!) transcriptions.
One of the good things about spending half a year of your life making a database of news locations is ending up with a huge list of common and obscure locations, and, if you retain original spellings, you can actually end up with a list of locations which includes lots of spelling variations. This is something really useful, and I wanted to try and see if I could check this list off against the transcribed newsbooks to give me a list of locations. I used the tutorial on The Programming Historian, and ended up with a list of places that also occurred in my own work (about 170 or so of the 1,200 or so individual, ‘dirty data’ place names). So this is a list of places that occur both in my data, from 1645 – 1649, but also occur in the newsbook transcriptions of 1655.
This was a good start. There are some interesting patterns showing up already, for example, it’s very heavily based on coastal towns and cities, and mostly centred around the Low Countries, Germany and parts of Northern France and Italy. Perhaps this is one view of the world constructed by these 1650s newspapers?
But there’s a problem: this is based on a list I’ve taken from other newsbooks, and so by definition will only record overlap between the two, rather than giving a true picture the geographics of the sources. To get a better idea, I turned to Named Entity Recognition – NER. NER allows you to ‘train’ a script to pick out certain types of words. Basically, you give it a smallish list, tell it this is a location, and this, and this, and so forth, and then it can try and guess the same type of word from new text you feed it. For this I used the Stanford NER, which is a pretty easy-to-use program with a GUI, although you’ll need to delve into the command line to train your own classifier.
I tried mapping various versions of the data, both with the default classifier and my own version. Interestingly, the default was better at finding entities than my own version, although in a future post I’ll try a bunch of different trainers and compare them for accuracy and false positives. Some of the maps are below, mapped with R, after lots of data cleaning and pretty painstaking geocoding.
This is the data found with my classifier. It’s not as comprehensive as the default, showing a pretty similar bias towards Northern Europe/coastal towns as my own collected data.
This is the default classifier: it’s definitely more comprehensive, but picked up quite a few false positives that required cleaning. The locations here are much more spread throughout France, but less in German/Holy Roman Empire.
This is the map generated from my own data, looking at the Moderate Intelligencer – a completely different source text. The spread of cities is reasonably similar, but there is much more of an emphasis on Italian news centres. Perhaps this reflects the move away from the Italian sources of news as relations with the Dutch and Spanish become more fraught but also more critically important?
And this is a map using the same method as above, with the source material being everything printed in London in 1649. Looking at the overall view, rather than a foreign-specialising title like the Moderate Intelligencer, it’s clear that the overall emphasis is on England, Scotland and Ireland, as well as France and the Low Countries, with a pretty even spread across most of the rest of Europe.
What to conclude from all this? Well, firstly, that different methods can produce very different maps, and it’s important to understand the methodology or the way in which the data was collected in the first place. But each of these tell us something about the way to understand how space was produced – in different ways depending on time, place, and the creators of the news. Each reader will see a vision or version of Europe which is highly individual; mapping in this way allows us to see parts of the overall trend but also something of that individuality. Lastly, this is a step towards my eventual goal: pulling out network data using linguistic and topic analysis, rather than relying on manual input. Someday.