I've recently came across two really cool data sources: 1) The Guardian Open Platform and 2) Trove API from the National Library of Australia. So, I decided to collect every news story they had on “mafias”. I managed to develop a corpus of about 2,500 documents combining news articles from these two database—not too bad! I wanted to see how discourses fluctuate overtime, so I trained a computer program to assume that each newspaper article consists of some combination of 20 topics/discourses, and created the following visualisation to explore the data.
All 20 topics are projected into the map as circles, the area and number on that circle is encoding the prevalence of a given topic. You can click on the circle/topic to see the relevant words on the right, with the red bar indicating the total number this terms appeared in that topic. The blue bar is the overall frequency of that terms in the entire corpus. This map can help us try to decode the semantic distance (or proximity) between the topics. Look how nicely certain topics are clustered together!
Below is the visualisation of The Guardian data. Click here to see the data from the National Library of Australia.by: email@example.com