I made a fun toy web app today while learning some new rails stuff. It’s a game where you try to come up with phrases that are long but also return a lot of search results on google. The inspiration comes from playing around with google myself, either looking at how often common set phrases appear in google scholar (e.g. “the rest of the paper is organized as follows”) or just looking at where unusual phrases pop up in Google (e.g. “I thought it was my girlfriend”).
I played around with different ways of scoring the searches. I want to reward long phrases and popular phrases; we don’t want “people” to win, but we also don’t want a long piece of text of that appears only once to win either. The scoring at this point is
(number of hits)*(number of words in phrase – 1)^3
This way you get no points for a single word search, and you seem to do pretty well if you come up with a common four word search like “the new york times.” I might tweak a bit to further reward long phrases, but it works pretty well for now.
There are a lot of things I would do to make this ready for prime-time, but my girlfriend and I can confirm that it’s pretty fun to play already. Below is a screenshot of the top 20 highest scoring words in about an hour of playing.
I really think this could be fun — I just need to spend a day or so on tweaking some things and get some front-end help from a couple of friends. Also finish my dissertation.
Last year I became interested in doing some research using Wikipedia. The idea that was most interesting to me was that the edit history, and possibly the discussions about particular topics, could provide a dictionary of words that are controversial with regard to particular topics. The motivation was that it’s hard to extract sentiment or point of view from e.g. newspaper articles and blog postings; it requires a lot of human input and even then it’s hard to get good results. (I’ve done a little of this in looking at partisan language in newspaper articles.) But Wikipedia might be able to help us.
Controversial words and phrases get added and deleted frequently (think “Hussein” in the article on Barack Obama or “illegals” on an article about immigration). If we could extract a list of controversial words for a set of subjects, we could then look at pieces of writing devoted to those subjects and gauge to what extent those controversial words are employed.
I think it’s an intriguing idea but I didn’t get very far working on it. I still think you could get a pretty good list of controversial words for a given subject (as long as there is a lot of editing on that subject), but I am not convinced that there would be all that much interest in using this list to assess how controversial particular pieces of writing are. The word lists might be interesting, particularly if they had a time component (e.g. relevant/controversial words about Barack Obama over the past two years), but it seemed unlikely to help me finish my PhD and publish articles vaguely related to political science.
Anyway, I spent some time today looking back at research about Wikipedia that has accumulated in the past year. There has been a lot — popular interest in Wikipedia really was really picking up in 2005 and 2006 and that means papers were published in 2007 and 2008. The growth, as well as a list of papers, is shown here
. There seem to be three kinds of papers on that list:
- Studies that analyze the behavior of Wikipedians, either individually or in aggregate, e.g. who writes Wikipedia, what are the rules, what are the rewards Wikipedians seek. Many articles use edit histories to do this analysis; others do interviews.
- Studies that look at the use of Wikipedia. Not many of these, but they look at using Wikipedia in the classroom, or which articles are popular, etc.
- Studies that use Wikipedia for natural language processing (NLP). This seems to be by far the biggest category in recent years, with people parsing Wikipedia to extract synonyms, learn about grammar, define categories, etc. Looks like a growth industry in linguistics and computer science.
In looking at the NLP articles I didn’t see much in the way of application of the insights gleaned from NLP (other than e.g. thesauruses), but I didn’t look that hard.
I got a few leads about APIs that should make it easier to do further NLP-type work with Wikipedia, and I’ll pass this along to a friend who is looking at using Wikipedia to define the sampling frame for a research project. And I’ll think some more about this “controversial words” project.
In getting ready to load a bunch of new trades into my database of stock trades by members of Congress, I was troubled to see that ActiveRecord was no longer working the way it was supposed to. For one of my models, every call to save prompted a false, which I noticed when I realized that none of the data I was trying to put was being saved. (This was a reminder that I should use save! for these data collection purposes.) At first I thought maybe the database was locked for some reason, but then I realized that saving worked on other models/tables, so that couldn’t be it. It was acting like a validation was failing, but I don’t have any validations for this model. I cursed and googled for a solution in vain.
Finally I looked into the ActiveRecord source code, which seemed to be not going anywhere until I saw that a key part of the validation process is a method called valid? For some reason I remembered that I had written a valid? method for this model (it checks to see if there is a company and a date for this trade) and I realized this must be it: none of the trades I am trying to put in have a company (that association comes later) so because they are not valid? Rails thinks they fail the validations.
So: I will try not to stomp on the ActiveRecord::Base code again. It’s a tricky bug to fix when you do.
The NYT piece by David Johnston raises some interesting questions and provides some useful answers about legal issues in the Blagojevich case. The article focuses on the difficulty of defining the “difference between criminality and political deal making,” which is an area that has been interesting to me. As the article points out, everyone knows that politicians take official actions in return for campaign contributions. I’ve wondered whether Blagojevich’s only real offense was to make the quid pro quo explicit.
The analysis in the article confirms that in fact Blagojevich would be in the clear if he had managed to keep his wheeling and dealing out of the investigators’ earshot. It also suggests that while explicitly offering a seat to a candidate in return for campaign contributions is illegal, it would be “easier to prosecute” if the benefits he was to receive were personal favors like cash or a job.
The fundamental issue raised by the article is that so far we only know of Blagojevich telling his aides that he wanted something in return for the seat — nothing we know of has him on the phone with a candidate making an offer. It’s possible, the article suggests, that the juiciest stuff about him wanting something in return for the Senate seat will amount to “just talk” and any prosecution will have to go forward on other material.
All of this suggests that this is another case (like Microsoft, like Enron) where the main lesson for other people who might want to bend rules in the future is to be smarter about communication. Blagojevich could easily have carried out a sale of the office with impunity if he had done it more subtly. So: let the sale of government offices continue, but let’s just be sure to not do it in such a crass and open way.
I’ve seen some speculation that Jim Steinberg might be among Obama’s top choices for National Security Adviser. I met Jim a few times when I was working at Brookings and I think he was the most impressive person I encountered there. One time Mary Graham and I went in to talk to him about some work we were doing on disclosure policy (transparency vs. security-type stuff) and I was taken aback by how quickly he cut through the clutter (both of what we were saying and of the policy area as a whole) and seized the key issues. I don’t remember specifically what he suggested or anything; I just remember a feeling of awe. One of the reasons I am excited about the new administration is that people like Jim Steinberg will be making important decisions.
On Waxy.org I found an interview and screencast about bandcamp.mu, a new site that tries to make it as easy for bands to publish and share music as it is for bloggers to write blog posts. From experience I know that bands struggle with how to share their music and establish a web presence. MySpace really changed how bands operated on the web but, from the bandcamp perspective, MySpace gives the band too little flexibility over how their stuff will be distributed. I hope this site is a success.
Last week the keyboard and trackpack on my MacBook Pro stopped working (again!). I took it into the Apple store and had to leave it there (again!). So, over the weekend it was back to the thinkpad. The transition was pretty easy because so much of my work is on svn and google docs. The best part was that I got to download and use Google Chrome, which is not yet available on the Mac. I think the overall performance was quite good, but there were a few little things I especially liked. So, my MacBook comes back with a replaced topcase (again!), and I’m happy to have it back, but now I miss a few things about Google Chrome as I go back to Firefox:
- The “type anything” address bar: I found it really useful to be able to type “besley coate scholar” and just key down to “Search Google Scholar for ‘besley coate'”
- The way it handles downloads is really nice — each document available as a little tab along the bottom of the window, with options for how to handle that file and others like it always available.
A few years ago I noticed people started saying “form factor” when they talked about the design of electronic devices. I found it annoyingly jargony and uneconomical, kind of like “price point”, which means price. Yesterday my friend Ryan used the term in reference to the shiny iPhone he was about to buy. On reflection I guess it’s not quite as bad as “price point.” “Form factor” refers to both the size and the shape of an object (and it appears to be used in a semi-technical way in electronic engineering), and I can’t seem to think of a single term that refers to both size and shape. Any ideas?
The intellectual case for transparency is easy to make, but it’s hard to demonstrate that transparency has the benefits its advocates (like me) claim. On Kevin Lewis’s List I came across a paper that appears to do a nice job of documenting the benefits of financial disclosure regulation to investors. The abstract is below. Their approach is to look at how the stock market reacted to events surrounding the passage of Sarbanes-Oxley. As SOX came closer to passage, firms that had been “managing” their earnings more closely saw a larger stock price boost than firms that had been doing less book-cooking. The authors appear to interpret this as evidence that SOX-induced financial disclosure benefited investors, since the firms more affected by it saw a larger boost in price.
It’s a result that may be counterintuitive to a lot of people: the BAD firms are the ones that benefited from the regulation. (More intuitive would be a story where the bad guys who had been getting away with something took a hit as it became clear that the free and easy days were over.) But the paper’s interpretation of stock price movements reflects the way financial economists tend to look at these things: investors had already discounted the price of these bad companies based on their questionable financial reporting, so signs that this would stop and that the real value of the firm would become clearer led investors to feel better about the investment and trade away the discount.
If this is how things work, why would firms obfuscate in the first place?
Also, how do we know that the news about SOX being passed was not interpreted by investors the opposite way, ie, “Well, if it’s going to be this lax, then these bad guys are going to get off easy”?
I may come back to this one. Anyway, here is the abstract:
Market Reaction to Events Surrounding the Sarbanes-Oxley Act of 2002 and Earnings Management
Haidan Li, Morton Pincus & Sonja Olhoft Rego
Journal of Law and Economics, February 2008, Pages 111-134
The Sarbanes-Oxley Act (SOX) of 2002 is the most important legislation affecting corporate financial reporting enacted in the United States since the 1930s. Its purpose is to improve the accuracy and reliability of accounting information that is reported to investors. We examine stock price reactions to legislative events surrounding SOX and focus on whether such stock price effects are related cross-sectionally to the extent firms had managed their earnings. Our univariate results suggest that significantly positive abnormal stock returns are associated with SOX events, and our primary analyses reveal considerable evidence of a positive relationship between SOX event stock returns and the extent of earnings management. These results are consistent with investors anticipating that the more extensively firms had managed their earnings, the more SOX would constrain earnings management and enhance the quality of financial statement information.
I’m on a mailing list of Kevin Lewis, who contributes to “Surprising Insights from the Social Sciences” in the Globe, in which he groups together interesting-looking abstracts on a particular theme. He does this every day. I find a lot of interesting papers this way, and it’s a pretty inspiring way to start the work day to see that other people are doing cool things, and that people care about it.
I expect to post some abstracts here; I wrote about a paper from the list yesterday on the Social Science Stats blog.
He appears to not have a blog, but he should. Mailing list is a pretty clumsy way of distributing this information, and I think a lot of people would be interested in getting it.
You can sign up by emailing him at email@example.com.