The project documentation is right here (have a quick look and maybe come back and read more…).
I’ve been attending #HACK4NO – a cultural heritage open data event and hackathon in Oslo, Norway. I’ve been part of the corresponding #HACK4DK event group doing the same kinds of events i Denmark for the past two years. Have a look at kulturognaturreise.no and hack4dk.wordpress.com for more info on the events.
Since I’d been busy doing practical stuff and running the events in Denmark, I haven’t really had the time to get into it and actually participate in projects and prototype development. In Norway, when I’d finished my presentation on the #HACK4DK events I was free to dive into the actual hacking of norwegian cultural data.
During the initial dataset speeddating session my eye had cought the open API of Det Store Norske Leksikon (SNL), the great national encyclopedia of Norway. The encyclopedia has a staff of roughly 5-10 people and a rather big network of writers. Certain knowledgable people are tasked and paid to write on specific subjects, thus building the corpus of content. The editors decide what articles should be produced. A reather different model than the one known from Wikipedia.
So I had a great urge to try and pitch those two ressources against each other, and see how and where they differ and share similarities. Long story short, I had luck getting 3 great guys into the idea, and this blogpost is about the protoype we made called “SNL +/- Wikipedia”.
We ended up doing a prototype that compare and contrast articles from Store Norske Leksikon with articles from Wikipedia. By drawing out the top 10 keywords based on relative frequency from each article and comparing the use of those keywords in the corresponding articles from SNL and Wikipedia, the projects help gain insight into the weight of certain keywords used to descripe the topics of the articles.
The prototype allows the user to input titles of articles from SNL and Wikipedia and present a visual comparison of keywords in the articles based on relative frequency of occurrence.
What you see above is a visualization of data from SNL and Wikipedia ordered by color. The size of the bubbles represent the combined freguency of words used in the two articles on “terror”, and the different colors within a single bubble shows the relative frequency of occurence in each article respectively.
In the above example we can see that Store Norske Leksikon and Wikipedia differ quite a lot in which words they use more to capture the essence of the concept “terror”. While SNL use the words fear (frykt) and horror (redsel), Wikipedia are not using those words a lot. Conversely, Wikipedia contains cause (sak), achieve (oppnå), political (politisk) and violence (vold), SNL doesn’t use these words at all.
It should be noted, that if a certain word shows up in the top 10 list of words from one article, we draw out this word (if it is there) from the other article as well, even though it is not a part of the top 10 list of words from that article.
I’m quite proud of the project and very happy to have worked with Dexter, Jules and Odin in Oslo! Here we are: