Mapping news stories automatically

I can’t believe, looking at my post history, that the last one was a shameful seven months ago. In the meantime I’ve been writing a chapter, marking my first exams, and attending a two week program in network analysis at the Folger Library in Washington DC. It was pretty incredible, and I’ll probably elaborate a bit more on some of the stuff we covered once it’s sunk in a bit more. In the meantime, I wanted to share some fun research which was partially inspired by, or at least motivated by, the two weeks there.

The network analysis stuff I’ve been doing, based on actually sitting down and painstakingly reading through lots of news stories and recording the connections, is useful, but in an ideal world I’d like to be able to sit back and say that the six months or more I spent on that was actually sort of pointless.. Not because the work that came out was not useful, but because I have developed a better way to automatically extract the same information. Reading everything in a world with so much information available isn’t sustainable. You can just about do it – Joseph Frank spent his life reading just about every single early newsbook, but now we’re all pretty much pressured into researching and publishing in a matter of months rather than years, and there’s always going to be a point where you are just overwhelmed by information,  How about the rest of the century? Or the eighteenth, when so much more was published, and stories were recorded in a similar way? And what about international news? And what about manuscript newsletters?

If we want to really understand early modern news in its entirety, that is to say, if we want to see how it ebbed and flowed over time, and how various networks overlapped and changed depending on time and space and politics and current events, I think there has to be another, faster way to record these connections. I haven’t quite got there yet, but I wanted to share some work which at least can draw some simple geographical maps based on transcriptions of printed news. In this way, we might be able to at least begin to ‘map the production of space‘, in the manner of Cameron Blevin’s work on 19th century Houston newspapers. This kind of automated data extraction, while basic, has the additional benefit of speed, and allows the possibility of drawing quick conclusions and comparisons. I love maps. I love how they are creative, flawed endeavours, based on the input of the creator but also the data that they use or feed to a computer. The post below will show some of the ways you can use similar data, or data extraction methodologies, to get very different results.

Extracting place-names wasn’t particularly difficult. The first thing you need are some transcriptions. Luckily, the reliable Lancaster newsbook corpus works perfect here, and is easily the highest-quality newsbook transcription easily and publicly available. It’s great for ‘proof of concept’ stuff, even though I don’t really think the content itself is particularly representative. It comes in a zip file, with individual transcriptions of issues of newsbooks from the mid 1650s, in XML format. The first thing I had to do was prepare the files for analysis. I needed to get them into one big text file, then strip out the XML code (or use it for something, which I’ll come to in a bit). The Python library ‘Beautiful Soup’ was made for this, and makes it easy to turn the files into basically a big blob of lowercase text, ready for analysis.

Making a big joined-up XML file was the easy part: all you need is the ‘cat’ command in the terminal on Mac OS/Linux. You can very quickly join the contents of a folder into one file using

cat * > merged-file

although of course then you end with all the extra XML statements, which I got rid of using a text editor and find/replace. Crude, but it worked.

The XML transcriptions have got some pretty useful TEI markup, which is definitely worth trying to use. The text is marked up for presentation rather than semantics, that is to say, in general, it will tell you which parts of the text constitute a title, or in italics for emphasis, but not whether something is a location or a person, unfortunately. I started with a small experiment: extracting all the emphasised text, which gives most of the locations and people (proper nouns) in the text. This in itself was a useful starting point, but would require a good deal of data cleaning to pick out the location data. I started thinking about other ways to get meaningful data from these basic (but wonderfully accurate and complete!) transcriptions.

One of the good things about spending half a year of your life making a database of news locations is ending up with a huge list of common and obscure locations, and, if you retain original spellings, you can actually end up with a list of locations which includes lots of spelling variations. This is something really useful, and I wanted to try and see if I could check this list off against the transcribed newsbooks to give me a list of locations. I used the tutorial on The Programming Historian, and ended up with a list of places that also occurred in my own work (about 170 or so of the 1,200 or so individual, ‘dirty data’ place names). So this is a list of places that occur both in my data, from 1645 – 1649, but also occur in the newsbook transcriptions of 1655.

 

mytrainingner

This was a good start. There are some interesting patterns showing up already, for example, it’s very heavily based on coastal towns and cities, and mostly centred around the Low Countries, Germany and parts of Northern France and Italy. Perhaps this is one view of the world constructed by these 1650s newspapers?

But there’s a problem: this is based on a list I’ve taken from other newsbooks, and so by definition will only record overlap between the two, rather than giving a true picture the geographics of the sources. To get a better idea, I turned to Named Entity Recognition – NER. NER allows you to ‘train’ a script to pick out certain types of words. Basically, you give it a smallish list, tell it this is a location, and this, and this, and so forth, and then it can try and guess the same type of word from new text you feed it. For this I used the Stanford NER, which is a pretty easy-to-use program with a GUI, although you’ll need to delve into the command line to train your own classifier.

I tried mapping various versions of the data, both with the default classifier and my own version. Interestingly, the default was better at finding entities than my own version, although in a future post I’ll try a bunch of different trainers and compare them for accuracy and false positives. Some of the maps are below, mapped with R, after lots of data cleaning and pretty painstaking geocoding.

This is the data found with my classifier. It’s not as comprehensive as the default, showing a pretty similar bias towards Northern Europe/coastal towns as my own collected data.

mytrainingner

This is the default classifier: it’s definitely more comprehensive, but picked up quite a few false positives that required cleaning. The locations here are much more spread throughout France, but less in German/Holy Roman Empire.

ner_data
NER data

 

This is the map generated from my own data, looking at the Moderate Intelligencer – a completely different source text. The spread of cities is reasonably similar, but there is much more of an emphasis on Italian news centres. Perhaps this reflects the move away from the Italian sources of news as relations with the Dutch and Spanish become more fraught but also more critically important?

mi_data
Moderate Intelligencer

 

And this is a map using the same method as above, with the source material being everything printed in London in 1649. Looking at the overall view, rather than a foreign-specialising title like the Moderate Intelligencer, it’s clear that the overall emphasis is on England, Scotland and Ireland, as well as France and the Low Countries, with a pretty even spread across most of the rest of Europe.

1649_data
1649 data

 

 

What to conclude from all this? Well, firstly, that different methods can produce very different maps, and it’s important to understand the methodology or the way in which the data was collected in the first place. But each of these tell us something about the way to understand how space was produced – in different ways depending on time, place, and the creators of the news. Each reader will see a vision or version of Europe which is highly individual; mapping in this way allows us to see parts of the overall trend but also something of that individuality. Lastly, this is a step towards my eventual goal: pulling out network data using linguistic and topic analysis, rather than relying on manual input. Someday.

Advertisements

Bad maps

In the days after the U.S. election, the website breitbart.com published a map of results which showed the huge geographical disparity between Trump and Clinton. This (hilariously amateur-looking) map showed an America of great swathes of red, dotted with a few tiny squares of blue, pushing the narrative that Clinton’s popular vote was shored up by ‘liberal elites’, living only in big cities. The heartland, or real America, Breitbart contended with this map as evidence, voted overwhelmingly for Trump:

imrs.php-3.png

The Washington Post printed a version which, in their words was “a map of actual American counties and not a red map that someone took into Microsoft Paint to dapple with little squares to have a fake map for a completely made-up story about the results of the election.“:

imrs.php-4.png

Political drama, played out over the tiny blue and red squares of U.S. counties. Both sides wielding maps as weapons – as objects with their own political agency rather than simply truth.

Maps have become the way through which we process election results, rather than bar charts or even numbers. In print media this has been the case for a long time (this one is from 1880), but now that most news is consumed on interactive, visually sophisticated devices, the importance and influence of the map has greatly increased. The election helps to inform opinions and even influence how we think of the winners and losers. The political data website fivethirtyeight.com joked in a recent conversation that candidates would now start chasing electoral college votes to make a map that looked aesthetically pleasing, like having a band of blue states going right through the middle of the U.S., for example. Political commentators are starting to realise the importance of representation, and are putting forward alternative ways of mapping elections.

Maps are inherently political, and the way in which data is represented on them is problematic and filled with ideological bias. The example above is towards the extreme, but maps have always been about imposing political or cultural ideology on their audience, whether it be the colossal pink regions of the British Empire or something more benign: maps of human rights or developmental aid, for example, promote a particularly political message. And as most schoolkids learn, there are editorial decisions that have to be taken when producing a 2D representation of a 3D space. Maps are the lens through which we view the politics of our world and they can be distorted.

With a map, the audience has a fixed perception, and as supposed direct representations of our world, we expect them to contain, above all, an element of truth. This can be a problem. Whereas other visualisations, whether that be a bar chart or a network map, rely on our instinctive powers of visual deductions (a bigger bar means more!), a map brings with it a pre-learned set of markers wrapped in a profound cultural bias.

All this should inform both our study and our representation of the historical data. Early modern ideas of the map were less fixed, and often based on a particular use-case rather than an attempt at total representation (maps of postal routes, for example, were often drawn linearly, from bottom-to-top of a page, rather than placed geographically on a map. Early modern maps were a response to particular demands placed on them, not always created for general use. My contention is that news, knowledge of its flow and carriage, had as much of an impact on reader’s geographical worldview as visual representation did.

The pull to use maps in my own work is strong; they’re aesthetically pleasing, accessible, fun. They allow us to instantly make sense of sometimes geographically-confusing ideas. Placing data on top of geography can help with understanding: certain connected points may seem illogical until you realise they are all on coastal regions, or beside mountains, or whatever. Fernand Braudel, the great 20th century historian, made maps the centre-point of his ideas of geographical determinism.

But the map must be tempered; problematised. It must be used alongside other methods of visualisation. And the reader of a map must understand it as we understand historical sources: who made it, why, and for what audience? We have  a responsibility to understand the provenance of the map – not only in farcical cases like that above, but even from reliable sources. We must realise that the time and space, even on the printed page, are not fixed but malleable, open to distortion and subject to human agency.

 

The power of metadata

This is incredible – an explanation of the value of email metadata and how it can be used to track individuals. This particular example is based on metadata taken from a hacker organisation based in Italy, which was revealed last year during a hack on the organisation itself. Using just metadata, it’s possible to tell a huge amount about individuals and groups: where they live, who their contacts are, the key people in an organisation – even where some people spend their Christmas!

Metadata Investigation : Inside Hacking Team

My work is also based on the power of metadata, something I’ve written about before. Where and when information was sent is much more revealing than we think at first, especially when analysed quantitively. It might be a great tool to historians, but perhaps its more contemporary uses are more morally difficult.

What’s wrong with a national history of the newspaper?

I’m just back from a conference in Dublin called Media Connections Between Britain and Ireland. There were some really interesting papers, and I think what I took away from it most was this idea of a shared history: that you can’t write a true history of ‘Irish’ media without referring to all the connections to other places. The term media ecology was mentioned several times, and although I’ve never heard it before it seemed so obvious and ubiquitous that by the end of the second day it had already entered our collective consciousness and felt cliched and buzzword-like. It struck me as a great way to describe how media, and indeed information, work. Information is well described as an ecology – a system which needs other elements to survive and thrive. In Ireland’s case the link to Britain is of course vital, but also links to other places throughout Europe. My own work finds Britain in a shared information landscape with Europe, and Ireland itself also part of this system; not reliant on its dominant neighbour for news though of course it is often also a conduit.

What arose as a point of discussion, through this idea of shared histories and connections, is the question of what it means to write a truly ‘national’ history of the press. This is something that the English, or British, historiographical scholarship has been dealing with for the past twenty or so years.  All histories are written with an eye to the present: the history of the newspaper is amongst the most obvious examples of this. The first histories of the newspaper sought to trace a line from the invention of the newspaper, through the invention of free speech, democracy and liberty. Jason McElligott, speaking at the conference, made the point that the first histories of the newspaper were in fact written by Americans: Joseph Frank’s The Beginnings of the English Newspaper writes as if the invention was an English, national phenomenon, something which is so obviously untrue to anyone who spends about five minutes reading the narrative of the first papers.

These first authors were implying a link: the inventors of the newspaper were also responsible for the spread of Enlightenment ideals – the rule of law, free speech, democracy, all through the freedom of the press and the creation of a public sphere where ideas could be debated freely. So why American historians, if this was perceived as an English phenomenon? Beginnings was  published in 1961, when the United States believed it had inherited the role of defenders of the ideas of free speech, democracy and so forth from the now-crumbling British Empire. To put it simply – the Americans needed a national, and specifically British version of history, in order to promote their own perceived position as the new defenders of Enlightenment ideals. A free press is essential to a free society, as the English model has shown us. Now it’s our turn to safeguard a free press – and while we’re at it, defend the world from those who abhor freedom (the communists). I’m not suggesting Joseph Frank had this specifically in mind when writing Beginnings, only that that’s the political and therefore historiographical landscape within which he was working at the time.

So that’s one reason we end up with a national, rather than a trans-national historiographical tradition. The other reason is that if we concentrate on the content of the newsbooks, and the readers of them, we naturally tend to work through one language, or perhaps two or three at the most with a talented, polyglot researcher. This method of writing a history is inherently national, promoting and insinuating essentially false linguistic divisions and political boundaries. We know now that news, if we see it as a spread of information, is actually deeply non-national and happily betrays political borders.

This is why we need new methodologies – old ways produce old histories, and will always be stuck to some degree in the national. When we can use digital tools – whether that be linguistic analysis, statistics, or in my case network science, we can free ourselves from the ‘tyranny’ of a national, publisher-author-relationship history, and begin to articulate what it is precisely we mean by a system of news that is transnational in scope. We can visualise it, explore it, describe it; not just declare its existence and leave it at that.

It’s not a perfect way of doing things: by concentrating on this ‘flow of information’, their is the risk of ignoring the reader – who, as it was pointed out to me at the conference, was decidedly national. Newspapers were read, after all, by readers largely of the same country, even if the information they were based on was travelling on a single, international system. An ideal history would bring both of these traditions together: not ignoring the content and the reader, but allowing for the fact that whether they knew it or not, they were part of this complexity. Only then will we have a true history of the newspaper.

Travel time and distance chart

This is a chart based on a pretty small subset (67 locations) of my data. I’ve plotted the average time for a story from the place to arrive back in London, against the distance (using modern roads, unfortunately).

The correlation is very strong, obviously enough.  Still, I think it’s a nice representation of the data. There’s quite a lot going on here. The size of the bubbles correspond to the city’s population, and they’re coloured by the network community they have been placed in by the software.

I suppose the most interesting finding here is just how closely linked time and distance are, in Western Europe at least. Information seems to travel at a remarkably constant rate throughout this part of the continent, at least as it makes its way back to Europe.  It shows, perhaps, how reliable the postal system was by this point in the 1640s.  It would be interesting to add data points from the further reaches of Europe and outside: I would imagine that the correlation would drop off extremely quickly.

It’s also a way of visualising which parts of Europe take equal amounts of time to send news back to London. Madrid and Milan, for example, take 19 and 20 days respectively, even though Milan is nearly 500 kilometres closer. So even though information is getting back from Madrid with more speed, this doesn’t equate to a better connection or more lines of news.

Anyway, have a hover through the data…

 

 

Metadata: an accidental lovenote to the future

I haven’t posted in over a month now. Most of January was spent hiding from bitter winds in the QMUL library and preparing for a funding application, which was ultimately unsuccessful and to be honest knocked the wind out of my sails quite a bit. Still, the research must go on, even if it must continue at a pace slower than I’d like.

I’ve mostly been preparing and planning a first draft of another chapter, so rather than having any new discoveries or insights to share on here, I’ve been buried in the data I’ve already been working on, and lots of secondary source material (I’m becoming a bit obsessed with Filippo de Vivo’s Information and Communication in Venice, and Fernand Braudel’s The Mediterranean).  I did spend quite a bit of time trying to come up with a method for guessing what dating system a particular town was using, which was good fun but ultimately a complete failure, which will be the subject of my next post…

In the meantime, here’s the text of an informal paper I presented at an evening in the University called ‘Cafe Scientifique’. It was supposed to be open to students from all disciplines but I think I was the only one who presented not from a STEM field – possibly the only one in the room, which was embarrassing and intimidating in equal measure. I’m mostly putting this paper up to get virtual ink back on virtual paper in the hope of getting some momentum going again for this blog, but it does explain some of the reasons why I think my research is worthwhile.

 

Metadata: an Accidental Love Note to the Future

I’ve called this talk ‘Metadata is a love note to the future’, from a quote I found online and ironically can’t seem to find any details of where it originated.  It struck me as great way to describe the purpose of metadata.  I interpret the quote as a reference to the deliberate addition of information – comments, tags and so forth to various types of content to help new users when the original creator is gone – whether dead, not known, or for whatever other reason.  The most common ones I can think of are adding comments to code to help the future unknown or unknowable programmers fix bugs or add new features, or maybe embedding tags in photos or video so that future viewers can know not just what’s on front of them, but additional data such as where a photo was taken, or the names of people starring in a video.  Metadata is a way of not only adding usefulness in the present, but preserving it for the future to understand and make sense of.

Today most of our metadata is deliberate, even if its usefulness is not always known in advance.  We add to content in the hope that one day it will help somebody make sense of something not apparent directly from the content that might otherwise have been lost, and its often directly for the benefit of computers, making photos searchable by google, for example. However, today I want to show you a case of metadata that can be said to be largely accidental, but has turned out to be a very a significant love-note to the future with wide-ranging implications as to how we view a crucial aspect of Europe in the seventeenth century.

The earliest recognisable newspapers, then called newsbooks, in England were published in the 1620s, and although only allowed to publish foreign news at first could print both domestic and foreign from 1641. The civil war brought an explosion of titles, many just lasting a few months. This is a page from a newsbook called the Moderate Intelligencer, published from 1645 – 1649.  I’ll give a quick description of the form, which is typical of most of the newsbooks around this time.  It’s a series of paragraphs of news, about mostly politics and military stories, from all over Europe.  Crucially, a convention had been established which meant in addition to the location of the story, information about where the news had been sent from and further places it had been relayed from there were included in the paragraph.  So a story about a town in Germany might contain metadata saying it was sent to London from Brussels, and had originally been sent there from Augsburg.

What makes the metadata an ‘accidental love note’ is how we can use it now for purposes not imagined at the time.  Luckily for us a series of connections, when we have enough of them, is exactly the information needed to make something called a network graph: a visualisation of a system of entities, called nodes, and connections between them, called edges.  [show slide two, mini-network] Network software can be used to take these nodes and edges and make a visualisation based on the importance of various nodes.  If a node has many connections, it’s more prominent and central on the graph.  We end up with a visual interpretation of the connectivity of Europe, based on its strengths and weaknesses, which looks like this:

network3

We are by our nature visual creatures: we can use a map like this to pick out patterns and draw conclusions that are very difficult to do from just looking at a rows and columns of text and numbers.  For example, we can say that the overall macro-structure of this network is a series of closely-linked clusters: there’s a cluster based around the Mediterranean and Iberian peninsula, incorporating Lisbon, Madrid, Marseilles, Genoa and Naples, another cluster based around Venice which serves as a massive hub with many small links primarily in the East, a cluster of the Low Countries, Paris and Barcelona, a cluster of German cities, one based around Prague and Vienna in the East, Hamburg, Gdansk and Scandinavia in the North and finally the British Isles.  It’s hard to define a single pattern for the connections: some seem to be based on geographical properties such as the Mediterranean, sometimes political or confessional lines.  Many of the placements of cities in clusters are intuitive or obvious, others less so – Paris, for example, is much more connected to the Low Countries rather than the rest of France, corroborating other studies that show Paris as part of the North-Western European urban structure rather than a central hub in a primarily French city system.

The network is also highly concentrated:  this last slide is a plot of the density of the network.  It essentially shows that most of the links are concentrated in a very small number of nodes, while a large amount of nodes have only one or two links.  This pattern shows up in a huge number of real-world networks, and part of my next area of research is trying to explain why it would be the case here even though it’s based around a European urban system with a population pattern that is much less concentrated.

The network paints a picture of a Europe that was a place of flow rather than stasis.  Early modern narratives, like our own, depended on the idea of movement and for some at least, was about transitions, flux and movement.  We can sometimes think of life at this time as very insular, lots of people interested only in family or local matters, but actually life wasn’t static and even with primitive communication could be surprisingly dynamic. The network illustrates this movement – the movement of people and their ideas, through the news stories they read and information they learn, through borders and political lines but also movement of the news stories themselves, which were like paper bullets, flying through the postal system, picking up lies and discarding truths or vice versa, taking on a life of their own, existing in their own right outside the content itself. It adds to the picture of Europe not being simply a group of individual inward-looking nations, but a place which already had a sense of a collective intellectual consciousness.  The study of the network helps us to quantify how this consciousness spread and allow us to discover the cities which were most important in allowing that to happen.

In a more general sense this story also shows the importance of reading between and outside the lines – not only looking at the content but also the data around the content, which can sometimes tell us as much or even more than the stories themselves.  Who wrote something, from where it was written, and when, particularly when analysed using a quantitative approach, can at the very least point us in new directions in our research, and sometimes even illuminate new ways at looking at the past.  I hope that work like this is going to point to a new direction for humanities research, where we combine ‘close readings’ of texts, with new techniques like network analysis which help us to ask new questions.

Isochronic map finally! (sort of)

A couple of changes to the map I posted in my last post, thanks to the awesome AlternativeTransport blog.

mapforiso2.png

 

They suggested having less categories and cropping out areas with no data, and it definitely looks better!

I also had a go at manually creating an isochronic map. In many ways this is actually less representative than the above, but more visually appealing. The data I have is only from point-to-point, so suggesting that everything within these areas has a similar travel time is a bit inaccurate.

isochrone2.png

 

Here’s probably as close as I’m going to get to a proper isochronic map for the moment. It was made by roughly joining the dots in the first map. It’s still a bit rough, but it communicates something anyway!