Last year I became interested in doing some research using Wikipedia. The idea that was most interesting to me was that the edit history, and possibly the discussions about particular topics, could provide a dictionary of words that are controversial with regard to particular topics. The motivation was that it’s hard to extract sentiment or point of view from e.g. newspaper articles and blog postings; it requires a lot of human input and even then it’s hard to get good results. (I’ve done a little of this in looking at partisan language in newspaper articles.) But Wikipedia might be able to help us.
Controversial words and phrases get added and deleted frequently (think “Hussein” in the article on Barack Obama or “illegals” on an article about immigration). If we could extract a list of controversial words for a set of subjects, we could then look at pieces of writing devoted to those subjects and gauge to what extent those controversial words are employed.
I think it’s an intriguing idea but I didn’t get very far working on it. I still think you could get a pretty good list of controversial words for a given subject (as long as there is a lot of editing on that subject), but I am not convinced that there would be all that much interest in using this list to assess how controversial particular pieces of writing are. The word lists might be interesting, particularly if they had a time component (e.g. relevant/controversial words about Barack Obama over the past two years), but it seemed unlikely to help me finish my PhD and publish articles vaguely related to political science.
Anyway, I spent some time today looking back at research about Wikipedia that has accumulated in the past year. There has been a lot — popular interest in Wikipedia really was really picking up in 2005 and 2006 and that means papers were published in 2007 and 2008. The growth, as well as a list of papers, is shown here
. There seem to be three kinds of papers on that list:
- Studies that analyze the behavior of Wikipedians, either individually or in aggregate, e.g. who writes Wikipedia, what are the rules, what are the rewards Wikipedians seek. Many articles use edit histories to do this analysis; others do interviews.
- Studies that look at the use of Wikipedia. Not many of these, but they look at using Wikipedia in the classroom, or which articles are popular, etc.
- Studies that use Wikipedia for natural language processing (NLP). This seems to be by far the biggest category in recent years, with people parsing Wikipedia to extract synonyms, learn about grammar, define categories, etc. Looks like a growth industry in linguistics and computer science.
In looking at the NLP articles I didn’t see much in the way of application of the insights gleaned from NLP (other than e.g. thesauruses), but I didn’t look that hard.
I got a few leads about APIs that should make it easier to do further NLP-type work with Wikipedia, and I’ll pass this along to a friend who is looking at using Wikipedia to define the sampling frame for a research project. And I’ll think some more about this “controversial words” project.