The Woodshed

Icon

Word Frequencies in Ruby

I just started working with Kevin Quinn on a large project focused on applying techniques from unsupervised learning to analyze the content of political speech. Here is a link to the first article to come out of this project, which analyzes the congressional record to assign floor speeches to categories based on the vector of word frequencies of each speech. Charmingly, the categories that result correspond to categories we might create as we grouped speeches (defense spending, education, etc).

I read that paper last summer and was very impressed and inspired by it. Now I’m lucky to be working on the project myself. (I really mean lucky — I did very little to deserve this, and in fact appeared to do my best to be passed over by not contacting Kevin in the fall even after he asked me to get in touch.)

My first task was to produce a matrix of word frequencies for a set of New York Times articles Kevin provided me: rows are words (or word stems) and columns are articles, so representative entry f_{w,a} is the number of times word w appeared in article a. I did my best to code this up in good OOP style. For me, this basically meant thinking a little about what conceptual objects I was dealing with (Articles and Corpuses, was what came to mind) and then looking for ways to wrap any “top-level” code that was left into these or other classes. The core of my solution is an Article class, each instance of which has a title and text, and that knows how to produce a hash of its word frequences, like

{“and” => 4, “mother” => 2, etc},

I also have a Corpus class, each instance of which has an array of Article objects. A Corpus knows how to produce a matrix of word frequencies for its Articles. Finally, we have a DirectoryOfTexts object, each instance of which has a directory location with texts in it, and which knows how to make an array of Article objects out of those texts that can be transformed into a Corpus. So the pseudocode is basically:

d = DirectoryOfTexts.new(“path to directory with new york times articles in it”)
c = Corpus.new(d.make_array_of_article_objects)
c.get_word_count_csv

And that produces the csv of word frequencies for this articles in this directory.

One little trick that I employed was to set a default value for my word frequency hashes, such that

h = Hash.new(0)

will return a value of 0 when I give it a key that it doesn’t have. This was useful in producing the matrix of word frequencies, because I could basically produce an array of unique words from all the word frequency hashes, and for each element of the matrix, just ask for

this_hash["this_word"]

and get a zero instead of an error in cases where the word was not in that articler

I found it very satisfying to think this through and work up the code, and I had a couple of thoughts:

1. I want to read more code that has good OOP style, and maybe read something on the topic. I feel like I’ve stumbled into some good practices but could speed things along by reading more good code and good theory. I should try to read some good code every day.

2. I wanted to explain to non-programmers what was so cool about this style of getting things done. I wondered again whether there are any good books or essays out there about the type of thinking that programming requires of you, or something that explains the zen of programming to a popular audience. It’s something I’ve thought about since taking lab electronics in college and even more since doing and teaching programming since come to grad school.