The past week or so I’ve been diving deep into Python. I’m trying to learn text processing techniques so that I can assemble data through screen scraping, both for research and for my voting recommendation project. For example, the first thing I’d really like to be able to do is parse SEC documents on the web in order to assemble a database of mutual fund proxy voting. There is a lot of data out there, and I’m tired of giving up if I can’t find it in a nice table somewhere (or assembling it in annoying ways like manual data entry). Python is going to help me fix this.
After some deliberation about whether to learn Ruby or Python, I settled on Python, largely because I heard the libraries were somewhat better developed and, more importantly, my computer scientist roommate appears to be a Python wizard and seems to like helping me along. So I started with the O’Reilly Book Learning Python, changed over to the Magnus Lie Hetland book Beginning Python, and am now starting up on the David Mertz book Text Processing in Python. I can report that the Hetland book is the best at getting off the ground — much more engaging than the O’Reilly book — but I found that Hetland’s examples/problems were a little too involved, in ways that seemed a little too obscure, for my speed. The “Regular Expressions HOWTO” was a great way to solidify my understanding of regular expressions, and the Mertz book looks like a good way to extend things a little further. I’ll try to report back on my progress.
Overall, I can say that I am really enjoying this. I am getting that euphoric feeling that comes with rapid progress at the beginning of pretty much anything (for me, particularly languages and musical instruments). But also it’s clear that this stuff is really useful for what I want to do, and opens up a lot of possibilities. I love it that you can write a few lines of code and extract email addresses from some webpage somewhere. Not that I’m ready to become a spammer or anything, but I can see how this is bringing me closer to being able to assemble information for people in a useful way, which is pretty much the goal.