Word Frequencies in Ruby

I just started working with Kevin Quinn on a large project focused on applying techniques from unsupervised learning to analyze the content of political speech. Here is a link to the first article to come out of this project, which analyzes the congressional record to assign floor speeches to categories based on the vector of word frequencies of each speech. Charmingly, the categories that result correspond to categories we might create as we grouped speeches (defense spending, education, etc).

I read that paper last summer and was very impressed and inspired by it. Now I’m lucky to be working on the project myself. (I really mean lucky — I did very little to deserve this, and in fact appeared to do my best to be passed over by not contacting Kevin in the fall even after he asked me to get in touch.)

My first task was to produce a matrix of word frequencies for a set of New York Times articles Kevin provided me: rows are words (or word stems) and columns are articles, so representative entry f_{w,a} is the number of times word w appeared in article a. I did my best to code this up in good OOP style. For me, this basically meant thinking a little about what conceptual objects I was dealing with (Articles and Corpuses, was what came to mind) and then looking for ways to wrap any “top-level” code that was left into these or other classes. The core of my solution is an Article class, each instance of which has a title and text, and that knows how to produce a hash of its word frequences, like

{“and” => 4, “mother” => 2, etc},

I also have a Corpus class, each instance of which has an array of Article objects. A Corpus knows how to produce a matrix of word frequencies for its Articles. Finally, we have a DirectoryOfTexts object, each instance of which has a directory location with texts in it, and which knows how to make an array of Article objects out of those texts that can be transformed into a Corpus. So the pseudocode is basically:

d = DirectoryOfTexts.new(“path to directory with new york times articles in it”)
c = Corpus.new(d.make_array_of_article_objects)

And that produces the csv of word frequencies for this articles in this directory.

One little trick that I employed was to set a default value for my word frequency hashes, such that

h = Hash.new(0)

will return a value of 0 when I give it a key that it doesn’t have. This was useful in producing the matrix of word frequencies, because I could basically produce an array of unique words from all the word frequency hashes, and for each element of the matrix, just ask for


and get a zero instead of an error in cases where the word was not in that articler

I found it very satisfying to think this through and work up the code, and I had a couple of thoughts:

1. I want to read more code that has good OOP style, and maybe read something on the topic. I feel like I’ve stumbled into some good practices but could speed things along by reading more good code and good theory. I should try to read some good code every day.

2. I wanted to explain to non-programmers what was so cool about this style of getting things done. I wondered again whether there are any good books or essays out there about the type of thinking that programming requires of you, or something that explains the zen of programming to a popular audience. It’s something I’ve thought about since taking lab electronics in college and even more since doing and teaching programming since come to grad school.

Insanity with LaTeX and screenshots

I am revising MyProxyAdvisor’s first grant proposal, and wanted to make the screenshots that appear in the document look better. I just spent between 2 and 3 hours on this, which is awful enough, but it was also one of these maddening processes where I’m not sure I really figured out exactly what was wrong but I can’t spend any more time on it. But I thought I would at least post here for my own satisfaction and possibly for the advancement of all digitized mankind. (At least the part messing around with arcane document processing software.)

The basic problem was that pdflatex was enlarging the image files I included in my LyX document. The image that appeared in the pdf created by LyX was much larger than the original image file I included. Not only was the image grainy and the text clownishly large, but the image spilled over the right edge of the page. I figured the problem was with LyX, pdflatex, or my screenshots. I tried various ways of taking the screenshot and various formats to save it in, and the same problem happened no matter what I did. I even downloaded a trial version of Better Screenshots, which is a pretty nice piece of software (possibly worth the $20 they’ll want from me in a month), and the same problem occurred. Compiling the file directly within WinEdt produced the same problem so it didn’t seem to be specifically a LyX problem at the compiling stage (although this seemed unlikely anyway). Finally I decided to mess around with the options within LyX at the stage of including the file. I just sort of randomly decided to specify the width as “4.5in” (should really specify “4.5” in the first form and select “in” from the second) and selected “Maintain aspect ratio” and voila, it worked perfectly. When you look at the LaTeX code it says


\caption{Screen shot of our application (development version)\label{mpaScreen}}



So this was probably a very obvious LaTeX solution, but I just could not understand why this problem would be happening — why pdflatex would be changing the size of the image during the compile process. Oh well, now I guess I know what to do, but it’s annoying that I don’t really understand why LaTeX chose to behave this way with these particular image files (and not others I tried). Maybe it had something to do with their dimensions?


I talked this morning with Nicco about producing diagrams of MyProxyAdvisor’s functionality (the flow of the user experience, navigation, etc). He said they develop their charts on a Mac program that ships with their operating system, and asked if I had Visio for Windows, which he understood was an alternative. Of course I don’t, but I was able to track down Dia, which is an open-source program designed as an alternative to Visio and available for Windows. I installed this and so far so good. In my 5 minutes of playing around with it I did not produce anything I couldn’t have done with OpenOffice draw or even Impress (their Powerpoint alternative) but probably when I get back to this I will find out more about its advantages.

Hard-earned 1890 tax dollars went to this. . .

I am doing a project involving data from the 1890 agricultural census. One of the variables I won’t be using:

Type: numeric Number of lambs killed by dogs, 1889.
U.S. Bureau of the Census [1895b], Table 8.

This data is provided by county. In my home county of Monroe, New York State, there were apparently 402 lambs killed by dogs in 1889.

In Humboldt County, CA, where Eureka is, there were over 8,000 sheep killed by dogs that year.

I can’t believe they were collecting this kind of thing.

Is there some badass research question hiding in this data about sheep and dogs? Any aspiring Steven Levitts are invited to chime in.

Browsing without a mouse

My roommate John had mentioned some Linux feature he had found that allowed for mouseless browsing: when you pressed a key, each of the links on a webpage would appear with a letter next to it, and you could follow a link by pressing that letter. No using the mouse. This morning I found a few Firefox extensions that allow this kind of browsing, and although I’ve only been using Hit-a-Hint for about five minutes I’m already hooked. The other one, the aptly named Mouseless Browsing, looks good too — it looks to me like it has more settings you can tweak. But by the time I had found MB I had already installed HaH, and I like the default behavior of HaH so I’ll stick with it for now.

As part of my delayed but accelerating descent into geekdom, I’ve come to understand this aversion to the mouse. I think that people really get nuts about keyboard shortcuts through some combination of a) using the computer enough for it to be important, and b) getting comfortable enough with their software to want to understand non-necessary but useful things like keyboard shortcuts. I’m increasingly there on both counts. Plus once you start trying to not use the mouse, it gets to be kind of an obsession. The less frequently you reach for the mouse the more you wonder whether there is a keyboard shortcut for that too.

In reading today’s Lifehacker, where Wendy had asked people to mention their favorite shortcuts, I was reminded of that Onion opinion piece where the guy was going on and on about the usefulness of keyboard shortcuts. I would never do that, would I?

Deep into Python

The past week or so I’ve been diving deep into Python. I’m trying to learn text processing techniques so that I can assemble data through screen scraping, both for research and for my voting recommendation project. For example, the first thing I’d really like to be able to do is parse SEC documents on the web in order to assemble a database of mutual fund proxy voting. There is a lot of data out there, and I’m tired of giving up if I can’t find it in a nice table somewhere (or assembling it in annoying ways like manual data entry). Python is going to help me fix this.

After some deliberation about whether to learn Ruby or Python, I settled on Python, largely because I heard the libraries were somewhat better developed and, more importantly, my computer scientist roommate appears to be a Python wizard and seems to like helping me along. So I started with the O’Reilly Book Learning Python, changed over to the Magnus Lie Hetland book Beginning Python, and am now starting up on the David Mertz book Text Processing in Python. I can report that the Hetland book is the best at getting off the ground — much more engaging than the O’Reilly book — but I found that Hetland’s examples/problems were a little too involved, in ways that seemed a little too obscure, for my speed. The “Regular Expressions HOWTO” was a great way to solidify my understanding of regular expressions, and the Mertz book looks like a good way to extend things a little further. I’ll try to report back on my progress.

Overall, I can say that I am really enjoying this. I am getting that euphoric feeling that comes with rapid progress at the beginning of pretty much anything (for me, particularly languages and musical instruments). But also it’s clear that this stuff is really useful for what I want to do, and opens up a lot of possibilities. I love it that you can write a few lines of code and extract email addresses from some webpage somewhere. Not that I’m ready to become a spammer or anything, but I can see how this is bringing me closer to being able to assemble information for people in a useful way, which is pretty much the goal.

Be ____ to strangers

Last week I went to DC to meet some people about my voting recommendations project. Besides the outrageous heat, the thing I noticed going back there is how much nicer people are down there. I guess I’m not really talking about the power nomads (is that a term?) who come to DC from elsewhere, but rather the people in service jobs you come across at restaurants and offices and airports. In just my little two-day trip I had more unnecessary interaction with strangers than I can remember having in a month here in Cambridge.

Then on Sunday I was walking (hobbling, actually) home from the doctor’s office (soccer injury) when I came across a marketing event for the Toyota Yaris in the parking lot on my street. This marketing group from LA was doing promotion in a bunch of East Coast cities where they give out a gift certificate if you do a test drive and fill out a survey. I got into this conversation with one of the marketing people — a really laid-back LA guy who seemed a little out of place in marketing — about why Boston and LA are laid out differently (he pushed the corporate conspiracy theory, I emphasized the timing of their development) and this older lady, very Cambridge-looking somehow, walked by and asked what was going on. He told her about the promotion, to which she replied that it was because of marketing like this that people don’t know how to vote and we have such a screwed up government. She then started walking away. Somehow I thought this kind of captured both the charm and curse of the Cambridge style — judgmental and private. The laid-back marketing guy said something to her about how he had reservations about what he was doing and I jumped in and told the lady that these were the most fuel efficient non-hybrid cars out there, which the laid-back marketing guy had just old me. This seemed to make her a little less hostile.

Anyway, somehow this clash of Cambridge communist with laid-back LA marketing guy kept this whole issue of regional personalities in high relief for me. I think it’s interesting how persistent these are, although not surprising given how strong the pressures are to conform to social conventions like how you treat strangers.

Kafka on the books we need

I was struck by this quote from Kafka, which I heard on Garrison Keillor’s Writer’s Almanac for July 3, 2006:

We need the books that affect us like disaster, that grieve us deeply, like the death of someone we loved more than ourselves, like being banished into forests far from everyone, like a suicide. A book must be the axe for the frozen sea inside us.

I have felt something like that with only a few novels — in the past couple of years, Blindness, Sophie’s Choice, Lolita, Executioner’s Song (not really a novel). I spend too much time reading boring nonfiction, which probably adds layers to the frozen sea. I want to read some more good fiction.

Not so deluded after all?

It’s a figure that just won’t go away: 19% of Americans believe they are among the top 1% of income earners; another 20% expect to be there someday. Since David Brooks first deployed this factoid in a November, 2002, article in the Atlantic Monthly, I have heard people refer to the figure on blogs, at academic talks, on radio shows, and in casual conversation. Not only does it speak to a basic American (or human) predilection for self-aggrandizement, the figure also suggests a simple reason why Americans are so resistant to income redistribution: if everyone thinks he’s rich, no one will respond to calls to soak the rich.

The problem with the figure is that it’s completely wrong. More precisely, the numbers are right, but the interpretation is way off. Here is what David Brooks’s wrote in his piece, “Superiority Complex”:

In America . . ., we can all be celebrities in some little sphere, and we are very impressed with ourselves. During the most recent presidential election a Time magazine-CNN poll asked voters whether they were in the top one percent of income earners. Nineteen percent reported that they were, and another 20 percent said that they expected to be there one day.

But the poll didn’t actually ask people whether they were in the top one percent of income earners. Here’s the actual survey question:

As you may know, Al Gore has claimed that George W. Bush’s proposed tax cut will largely benefit those with high incomes, who he claims are the top 1%. Thinking about your own situation, do you think that you are in the top group that would benefit from Bush’s proposed tax cut now, do you think you will be in this group that will benefit in the future, or do you think you will not benefit from Bush’s tax cut?

19% Will benefit right away
20 Will benefit in the future
55 Will not benefit
6 Not sure

(Survey by Time, Cable News Network. Methodology: Conducted by Yankelovich Partners, October 25-October 26, 2000 and based on telephone interviews with a national adult sample of 2,060.)

The question is simply an awful way to figure out how people assess their incomes relative to other Americans. It is more closely addressing two highly politicized questions about Bush’s tax plan. First, would Bush’s plan lower taxes only for the top 1% of income earners? As the question makes clear, this was Gore’s claim, denied by Bush. The question therefore asks, “Which candidate do you trust?” Second, does a tax on high income earners have indirect benefits for people of lower incomes? Republicans have tended to argue that high marginal tax rates discourage hiring and innovation, ultimately harming the entire economy, while Democrats have been suspicious of what they call “trickle-down economics.” On this point, the question posed is, “Which party’s economic philosophy do you support?”

Now, if you asked these underlying questions directly, restricted your attention to the subset of respondents who thought that the plan would directly affect only the top 1% and have no indirect benefits for others, and looked at the responses given by this subset to the question actually posed, you could learn something about this group’s assessment of its position in the income distribution. But of course this group would not be representative, since the underlying questions are politically charged and we all know that political beliefs correlate with economic position. So my conclusion is that we can’t learn anything from this survey question about how people assess their current and future income relative to that of their peers.

In fact, Americans don’t seem to delude themselves up the economic ladder at all. The figure shows how 2,800 Americans responded to a question in the 1998 General Social Survey asking, “Compared with American families in general, would you say your family income is far below average, below average, average, above average, or far above average?” Almost half of respondents said their family’s income was average. The remaining respondents were more likely to put themselves below average than above average. Fewer than 3% said they were “far above average.” This pretty much how sober realists would sort themselves. This suggests to me that almost all of the 19% of people who said that they would benefit from Bush’s tax cut chose that response because they disagreed with Gore’s claim, not because they put themselves in the privileged 1%. (GSS respondents were economically representative of American families. The median respondent put his family’s income in the $35,000 to $39,000 bracket; based on US Census Bureau figures, the median income of all households in 1998 was $38,885.)

Let’s do away with this pesky “fact.” Not only because it’s fiction, but also because it misrepresents the economic experience of Americans and obscures the political context for redistribution in this country.