OdeToCode IC Logo

In Search Of Raw Data

Thursday, November 25, 2004 by scott
Perhaps if Samuel Taylor Coleridge were alive today, he’d write verse like the following:

Data, data, every where,
And all the disks did shrink;
Data, data, everwhere,
Nor any bit to think.

Then again, if Coleridge were alive today maybe he’d be spending all of his time watching reenactments of Mongol military campaigns on cable TV’s History Channel and never write a single poem. Who knows? At least he had the foresight to end lines with semi-colons.

I’ve wanted to experiment with the new DTS and Analysis Services offerings in SQL Server 2005 for some time. I’ve spent a fair amount of time with the current versions and have come to love and loath certain features. I decided the first order of business would be to find an interesting set of data to work with from start to finish. Some google searches for raw data in CSV format turned up several possibilities.

The Federal Justice Statistics Resource Center collects information on crime, but inappropriate use of the data violates three federal regulations. Next.

The Baseball Archive is highly touted as having the most complete set of baseball statistics anywhere. These stats are already sliced and diced to death. Next.

Carnegie Mellon has StatLib – a collection of various datasets and statistical software. There were some interesting possibilities here.

The Center For Disease Control and Prevention publishes tons of data, but not all in easy formats, and I’m already hip deep in mortality data during the day. Next.

A Jounalist’s Database of Databases has a collection of links to national and international sources of data – this is where I found the following entry.

The Bureau Of Transportation Statistics has plenty of information about planes, trains, and automobiles. The web interface allows you to pick which fields to export. One data set that caught my eye was the on-time performance of domestic flights. It’s a simple CSV import and might make for an interesting OLAP cube. One month of data looks like it has about 1 million records.

Perhaps if Coleridge were alive today, and had a bad experience at the airport, he’d write something like this:

Day after day, day after day,
We stuck, nor breath nor motion ;
As idle as a painted jet,
On this tarmac of commotion.

Remoting Over TDS

Sunday, November 21, 2004 by scott

This week saw the release of a long awaited white paper: Using CLR Integration in SQL Server 2005. This white paper is the first step in what I am sure will be an ongoing stream of best practice information for using the common language runtime inside the database engine.

One area in need of some additional guidance will be the design and usage of user defined types (UDTs). In SQL 2005, if you need a columnar data type outside of the normal varchar, int, and datetime etc. types, you can write your own. As the paper outlines, UDTs are not a way to achieve object-relational extensibility, so we won’t want to create a Customer UDT. Instead, UDTs will augment the basic scalar data types in SQL Server. The engine can index a column of UDTs and enforce RI constraints against them.

All of the examples I’ve seen to date deal with UDTs from TSQL and CLR code executing inside database. Wally McClure found the following error message when using a SELECT statement from the query tool:

Could not load file or assembly "Assembly Name, Version=xxxx.xxxx.xxxx.xxxx, Cultere=neutral, PublicKeyToken=null' or one of its dependencies. HRESULT: 0x80070002 The system cannot find the file specified

Niels Berglund explains why this happens. If you use a ToString on the column in the SELECT statement you’ll get back strings, but with no methods invoked you fetch actual instances of the UDT type. The query tool’s appdomain has to load the assembly for this type. Unless the assembly is properly deployed the load will fail.

The interesting thought here is that there is a new .NET remoting channel in town: the tabular data stream protocol used natively by SQL Server. Along with this capability comes all of the intricacies of sharing type across a boundary, like versioning and GAC deployments. Whoa.

Indigo Day

Friday, November 19, 2004 by scott

VSLive! just added an entire day of Indigo content to the conference schedule for San Francisco. February 8th is looking like a busy day.

I wonder if it will snow in the Northeast while I am away. It's always fun to call up snowbound  colleagues back in the office and ask them how the weather is. Last February was so cold here I had soda cans exploding in my car. Brrr.

Virtual Machine Freestyle Battle

Wednesday, November 17, 2004 by scott

I’m driving through the streets today when I pull up beside a car with a personalized license plate: “JAVA VM”. I’m obviously beside a developer who is fanatic about the Java virtual machine.

I guess I’ve been reading too many of the accidental freestyle battles on Bill’s blog lately. I tossed in a CD with some bass thumping action, wound down my window, and started screaming out loud:

Yo, VM boy, you got a queued up packet,
You better call Receive and prepare for the racket.
You didn’t hit return before the left curly bracket,
    gonna send you upstate, in a straight white jacket.

Ya’ll used to say we were the bourgeoisie,
Now .NET is spreading like hepatitis B.
I know you is scared, cuz I’m a wack MC,
    and I got more sources than ODBC.

Better throw up your hands, and your pocket protector,
I’m gonna whip out my king sized garbage collector.
It’s coming after you like a bee after nectar,
    you’ll never add another String to an object typed Vector.

I was ready to lay down some more verse, but the light turned green. The guy put the gas pedal to the floor and drove off in a hurry. I began to revel in my freestyle geek-rap victory.

Then the phone rang, and I woke up.

I swear, if I doze off at the computer, I can have the most pathetic dreams…

New User Group In Town

Tuesday, November 16, 2004 by scott

The BaltoMSDN .NET User Group is just getting started. This month’s meeting is Wednesday, November 17th in Hunt Valley, Maryland. Lonnie Wall from RDA Consultants will be giving an overview of Service Oriented Architecture. If you are in the Baltimore / D.C. area, keep an eye on their calendar and drop in often.

64 bits

Sunday, November 14, 2004 by scott

I think 2005 will be a big year for AMD in the 64 bit server market. Dell has only shipped Intel chips for the longest time, but they are hinting about building 64 bit servers and workstations with AMD CPUs in the future. The move to AMD would make sense. From everything I’ve read, the 64 bit AMD chips are giving Intel a pretty thorough trouncing in performance and Intel is not catching up soon enough.

Most of our customers have a data warehouse in the 40GB – 300GB range, and some are growing at 100 MB a day. Not the biggest databases in the world by any means, but most of these hospitals are using conservative hardware. We do most of our performance testing against a 500 Mhz Pentium database server with 1 GB of RAM and a 100GB warehouse. If we can get a 10 second response times on this machine, it’s optimized for production use.

Most of our customers are not willing to fork over the money to keep the OLAP database and the warehouse on separate machines. They want to keep everything on a single machine to lower licensing and administration costs. The problem with the single machine setup is apparent when watching Analysis Services crunch through 100 million records during cube processing. It’s like watching a steel cage death match between OLAP and SQL Server as to who can grab the most RAM.

The single machine might work if only more memory were available. The 32 bit version of Analysis Services does not support Address Windowing Extensions (AWE), so the most memory the process can have is 3 GB with the /3GB switch enabled, a barrier the 64 bit version overcomes. SQL 2000 64 bit only supports Itanium, whose sales cannot seem to get off the ground. SQL 2005 will support AMD64. I’m thinking this will fly off the shelf for anyone doing OLAP work with a warehouse (though we’ve yet to see pricing for 2005).

I did a quick search for dual 64 bit Opteron servers on eBay today and found this. The server uses 40 400GB SATA drives to offer 16TB of storage, and has 8 GB of RAM. Yours for only $30,000 US. Just think, in five years, you could probably replace it with a notebook or cell phone.

Speaking & Presentations

Sunday, November 14, 2004 by scott

Upcoming Presentations in 2009!