Fecal Treacle
I got all excited when I came across this page:
http://www.missoulian.com/articles/2007/12/03/publicrecord/public2.txt
I'm all, "Hot! Dog!" 'cause what better way to keep up on the hometown gossip than the Public Records? Better still, if you look at that URL, you can see that it can be easily parsed into date components, thereby leading an interested party (me) to automate the shit! So a couple weekends ago, I spent the better part of four hours writing a screen scraper that would automagically fetch all the archives and parse out the names and jack them into a database. The grand idea was to have a searchable repository of names so I or anyone else could see whether that slutty cheerleader from high school is on her second or third DUI, or your chemistry teacher from 7th grade is a sex offender, and other sorts of schadenfreudian fun.
If you've ever done it, you'll know that screen-scraping can be a major exercise in agony, especially for newspaper sites. The basic idea is to build a tree out of the HTML that makes up the page, then grab the shit that you're interested in and do something with it. There are several free HTML parsing libraries that can make sense of even the most god-awful HTML, but it's still up to the user to figure out what goes where and be able to dope out some sort of rhyme or reason to it. But the Missoulian is one of the most fucked-up I've come across. Most newspapers' websites are like that because they're trying so desperately to cling to life by selling online ads, which usually take the form of really sloppy JavaScript and/or dynamically generated HTML. The inevitable result is a goddamn mess.
So I got quite a ways down the road before realizing that I was endeavoring in a essentially fruitless effort, and one that might perhaps not be the optimal use of my extremely scant free time. At first glance, the content looks fairly uniform. The word "Marriages" is in a bold tag, followed by a string containing two names and a date. The two names can reliably be split using the "and", and the date can be determined by getting everything after the last comma.

The first name under Marriages is a good example of the frustration incurred when doing this type of thing. I'd start by simply splitting the line into two names using the "and":
Lisa K. Devereaux
Kenneth W. Oldperson Jr
Then split each name into its component parts by splitting out the spaces. But lookee there! Kenneth Oldperson has that "Jr" at the end, leading to all sorts of ambiguity. My parser thinks it's his last name! Okay, so I make an exception for that case, only to come across someone with a III at the end of their name, or two middle names or a first name that's two initials, and you very quickly realize there's no good way to account for all of the possibilities.
And take a look at the Births section. For one thing, from the DOM perspective, the first three entries are in one DIV and the second three are in another. Obviously, this doesn't lend itself to the predictablility I need in order to parse this shit reliably. Then you'll notice that the first entry is a single mom, another exception case. Then there's the words "twin girls," yet another irregularity.
But it's the Sex Offenders section that presents the biggest headache. Consider this line:
Thomas Beaulieu, 345 W. Front St., No. 2, Missoula
It's almost impossible to parse shit like this in a repeatable fashion with all those commas and periods in there.
So I gave up, deciding that my time was much better used by just looking at the new entries every Monday. I must, however, give nods of admiration to the Hpricot HTML parser library and the thoughfully-named HTML Parser Java library. But really, newspapers should really think about RSSing this type of thing.
Yale.
- charlie December 03, 2007 19:50You may possibly have too much time on your hands!
parse my arse.
love,
- Bruce December 05, 2007 11:26b
'tis a farce, of carse.
- arndy smetarnka December 06, 2007 09:37Wow. I love the idea of data-mining like this. It is a supposed trend to do stuff like this with academic articles to find material not listed in the abstract, meta-data, or keywords. You will have to give me a primer on it whenever I see you next.
- Rick December 12, 2007 12:29Imagine the demographic info you could pull from such a tool, if you were to get around all the irregularities. For example, I can picture a missoula map with colors reflecting the number of sex offenders per square mile or something, so "stay off the 2300-2400 blocks of broadway, young persons!" Or, "Wow, quite a number of marijuana related misdemeanors in that lower Rattlesnake!" "There appears to be a corolatory effect between Griz home games and drunk driving."
- Josh December 19, 2007 13:09