OdeToCode IC Logo

XM Radio Player Part II : Scraping

Friday, December 5, 2008

Just to make sure everything was as easy as it looked in Fiddler – I wrote a quick and dirty piece of throwaway code to see if I could programmatically  login to XM and play a stream of music with Windows Media Player. It was ugly, but …

public void Can_Start_Media_Payer_With_Xm_Stream()
{
    var cookies = new CookieContainer();
    
    // step 1: get auth cookie
    HttpWebRequest request = 
WebRequest.Create("http://xmro.xmradio.com" + "/xstream/login_servlet.jsp") as HttpWebRequest; request.CookieContainer = cookies; request.ContentType = "application/x-www-form-urlencoded"; request.Method = "POST"; var requestStream = request.GetRequestStream(); var loginData = Encoding.Default.GetBytes(
"user_id=*******&pword=******"
); requestStream.Write(loginData, 0, loginData.Length); requestStream.Close(); HttpWebResponse response = request.GetResponse()
as
HttpWebResponse; var data = ReadResponse(response); // ... // step 4: get player URL for channel 26 request = WebRequest.Create(
"http://player.xmradio.com"
+ "/player/2ft/playMedia.jsp?ch=26&speed=high") as HttpWebRequest;
request.Method = "GET"; request.CookieContainer = cookies; response = request.GetResponse() as HttpWebResponse; data = ReadResponse(response); Regex regex =
new
Regex( "<param\\s*name=\"FileName\"\\s*value=\"(.*)\"", RegexOptions.IgnoreCase); string url = regex.Match(data).Groups[1].Value; Process.Start("wmplayer.exe", url.ToString()); }

… it worked!

The most difficult piece was digging out the FileName parameter from the playMedia servlet response. The HTML response was almost, but not quite, XHTML complaint. LINQ to XML wasn’t an option, unfortunately, so I used a regular expression. I don’t write regular expressions often enough to build them without a lot of help, which is why I keep a copy of Roy Osherove’s Regulator around. I could paste a copy of the servlet response into Regulator and then iteratively hack at a regular expression syntax until something worked and I cold stop cursing. In a couple weeks I’ll have no idea what stuff like \”(.*)\”. means anymore, which is both good and bad at the same time.

It’s always liked the idea of looking at the high risk areas of a project and writing some throwaway code to see if a solution is technically feasible. Some people call this a “spike” and as I’ve said before – you never want to let spike code into production (but you never want to really throw the code away, either).

Screen Scraping

The regex in the above code is “scraping” data from the HTML. Screen scraping is nearly as old as computers themselves – in fact it’s one of the earliest forms of interoperability. One way to get data from a mainframe is to  capture the data from the mainframe’s display output on a terminal– thus the term “screen scraping”. Not that I’ve ever written a terminal scraper, but I’ve seen them in action and I wouldn’t be surprised if 90% of all Fortune 500 companies that have been in business for more than 30 years are still using screen scraping somewhere for enterprise application integration.

Something I have done quite a bit is screen scraping for the web. “Web scraping” (as some call it) has a slight negative connotation these days as it can be used for nefarious purposes – like the spammers who harvest email addresses from web pages. There have also been some legal battles when people screen scrape to repurpose someone else’s content. But generally speaking - screen scraping isn’t inherently bad. In this case I’m only using scraped data to build a custom player and listen to music with my paid subscription.

With the .NET libraries there are really just three key areas to focus on:

  • Properly formulating the request
  • Managing cookies
  • Parsing the response

Properly formulating the request was relatively easy in this scenario, even with the credentials in the POST data. Comparing what your program is sending to the server with what the web browser sends to the server using a tool like HTTP Fiddler is the best way to troubleshoot problems. Some scenarios aren’t as easy. For instance, if you need to programmatically post to an ASP.NET web form you need the proper hidden field values for stuff like viewstate and event validation, which means you need to parse these hidden fields from a previous GET request for the same web form.

If a web site requires cookies to work properly (as XM does), then you’ll find that managing cookies with the .NET libraries is tricky – but only because it doesn’t happen by default. However, all you need to do is instantiate an instance of the CookieContainer class and attach this same instance to all of your outgoing web requests. .NET will then take care of storing cookies it finds embedded in a server’s response, and transmitting those cookies back to the server on subsequent requests. The framework knows all about cookie domains and secure cookies, so you don’t need to manually finagle HTTP headers at all – just use the container.

Parsing the response is generally where you invest most of your time and feel the most pain. There are managed libraries like the Html Agility Pack that can help, and 1000 other solutions from String.IndexOf, to regular expressions, to Xpath queries over XHTML complaint markup. The basic problem is that sites generally don’t design their web pages for machine-to-machine interaction and they can change their markup at any time and break your scraping logic. No matter how robust your scraping logic is – nothing short of a perfect AI can keep your logic from breaking. It’s always better to stick with official interoperability endpoints that use SOAP, REST, or some other form of structured response that is intended for machines. It’s strange that we are almost in the year 2009 and we will still be using strings and regular expressions to interop with most of the web for the foreseeable future.

Next Steps

Knowing that a custom player was possible, the next step was to pick a platform to build on.

Stay tuned…