OdeToCode IC Logo

In Search Of Raw Data

Thursday, November 25, 2004
Perhaps if Samuel Taylor Coleridge were alive today, he’d write verse like the following:

Data, data, every where,
And all the disks did shrink;
Data, data, everwhere,
Nor any bit to think.

Then again, if Coleridge were alive today maybe he’d be spending all of his time watching reenactments of Mongol military campaigns on cable TV’s History Channel and never write a single poem. Who knows? At least he had the foresight to end lines with semi-colons.

I’ve wanted to experiment with the new DTS and Analysis Services offerings in SQL Server 2005 for some time. I’ve spent a fair amount of time with the current versions and have come to love and loath certain features. I decided the first order of business would be to find an interesting set of data to work with from start to finish. Some google searches for raw data in CSV format turned up several possibilities.

The Federal Justice Statistics Resource Center collects information on crime, but inappropriate use of the data violates three federal regulations. Next.

The Baseball Archive is highly touted as having the most complete set of baseball statistics anywhere. These stats are already sliced and diced to death. Next.

Carnegie Mellon has StatLib – a collection of various datasets and statistical software. There were some interesting possibilities here.

The Center For Disease Control and Prevention publishes tons of data, but not all in easy formats, and I’m already hip deep in mortality data during the day. Next.

A Jounalist’s Database of Databases has a collection of links to national and international sources of data – this is where I found the following entry.

The Bureau Of Transportation Statistics has plenty of information about planes, trains, and automobiles. The web interface allows you to pick which fields to export. One data set that caught my eye was the on-time performance of domestic flights. It’s a simple CSV import and might make for an interesting OLAP cube. One month of data looks like it has about 1 million records.

Perhaps if Coleridge were alive today, and had a bad experience at the airport, he’d write something like this:

Day after day, day after day,
We stuck, nor breath nor motion ;
As idle as a painted jet,
On this tarmac of commotion.