XM Radio Player Part II : Scraping

Friday, December 5, 2008

Just to make sure everything was as easy as it looked in Fiddler – I wrote a quick and dirty piece of throwaway code to see if I could programmatically  login to XM and play a stream of music with Windows Media Player. It was ugly, but …

public void Can_Start_Media_Payer_With_Xm_Stream()
{
    var cookies = new CookieContainer();
    
    // step 1: get auth cookie
    HttpWebRequest request = 
WebRequest.Create("http://xmro.xmradio.com" + "/xstream/login_servlet.jsp") as HttpWebRequest; request.CookieContainer = cookies; request.ContentType = "application/x-www-form-urlencoded"; request.Method = "POST"; var requestStream = request.GetRequestStream(); var loginData = Encoding.Default.GetBytes(
"user_id=*******&pword=******"
); requestStream.Write(loginData, 0, loginData.Length); requestStream.Close(); HttpWebResponse response = request.GetResponse()
as
HttpWebResponse; var data = ReadResponse(response); // ... // step 4: get player URL for channel 26 request = WebRequest.Create(
"http://player.xmradio.com"
+ "/player/2ft/playMedia.jsp?ch=26&speed=high") as HttpWebRequest;
request.Method = "GET"; request.CookieContainer = cookies; response = request.GetResponse() as HttpWebResponse; data = ReadResponse(response); Regex regex =
new
Regex( "<param\\s*name=\"FileName\"\\s*value=\"(.*)\"", RegexOptions.IgnoreCase); string url = regex.Match(data).Groups[1].Value; Process.Start("wmplayer.exe", url.ToString()); }

… it worked!

The most difficult piece was digging out the FileName parameter from the playMedia servlet response. The HTML response was almost, but not quite, XHTML complaint. LINQ to XML wasn’t an option, unfortunately, so I used a regular expression. I don’t write regular expressions often enough to build them without a lot of help, which is why I keep a copy of Roy Osherove’s Regulator around. I could paste a copy of the servlet response into Regulator and then iteratively hack at a regular expression syntax until something worked and I cold stop cursing. In a couple weeks I’ll have no idea what stuff like \”(.*)\”. means anymore, which is both good and bad at the same time.

It’s always liked the idea of looking at the high risk areas of a project and writing some throwaway code to see if a solution is technically feasible. Some people call this a “spike” and as I’ve said before – you never want to let spike code into production (but you never want to really throw the code away, either).

Screen Scraping

The regex in the above code is “scraping” data from the HTML. Screen scraping is nearly as old as computers themselves – in fact it’s one of the earliest forms of interoperability. One way to get data from a mainframe is to  capture the data from the mainframe’s display output on a terminal– thus the term “screen scraping”. Not that I’ve ever written a terminal scraper, but I’ve seen them in action and I wouldn’t be surprised if 90% of all Fortune 500 companies that have been in business for more than 30 years are still using screen scraping somewhere for enterprise application integration.

Something I have done quite a bit is screen scraping for the web. “Web scraping” (as some call it) has a slight negative connotation these days as it can be used for nefarious purposes – like the spammers who harvest email addresses from web pages. There have also been some legal battles when people screen scrape to repurpose someone else’s content. But generally speaking - screen scraping isn’t inherently bad. In this case I’m only using scraped data to build a custom player and listen to music with my paid subscription.

With the .NET libraries there are really just three key areas to focus on:

  • Properly formulating the request
  • Managing cookies
  • Parsing the response

Properly formulating the request was relatively easy in this scenario, even with the credentials in the POST data. Comparing what your program is sending to the server with what the web browser sends to the server using a tool like HTTP Fiddler is the best way to troubleshoot problems. Some scenarios aren’t as easy. For instance, if you need to programmatically post to an ASP.NET web form you need the proper hidden field values for stuff like viewstate and event validation, which means you need to parse these hidden fields from a previous GET request for the same web form.

If a web site requires cookies to work properly (as XM does), then you’ll find that managing cookies with the .NET libraries is tricky – but only because it doesn’t happen by default. However, all you need to do is instantiate an instance of the CookieContainer class and attach this same instance to all of your outgoing web requests. .NET will then take care of storing cookies it finds embedded in a server’s response, and transmitting those cookies back to the server on subsequent requests. The framework knows all about cookie domains and secure cookies, so you don’t need to manually finagle HTTP headers at all – just use the container.

Parsing the response is generally where you invest most of your time and feel the most pain. There are managed libraries like the Html Agility Pack that can help, and 1000 other solutions from String.IndexOf, to regular expressions, to Xpath queries over XHTML complaint markup. The basic problem is that sites generally don’t design their web pages for machine-to-machine interaction and they can change their markup at any time and break your scraping logic. No matter how robust your scraping logic is – nothing short of a perfect AI can keep your logic from breaking. It’s always better to stick with official interoperability endpoints that use SOAP, REST, or some other form of structured response that is intended for machines. It’s strange that we are almost in the year 2009 and we will still be using strings and regular expressions to interop with most of the web for the foreseeable future.

Next Steps

Knowing that a custom player was possible, the next step was to pick a platform to build on.

Stay tuned…


Comments
Ray Friday, December 5, 2008
This is good stuff, I wrote a Sirius player a long time ago but I bailed on the scraping and settled on a single WinForm with docked WebBrowser control that automatically navigated to the Sirius internet logon page, and could be minimized to system tray. Not the best thing ever but you don't get all the song/artist information in firefox so having an IE window open all the time was damn annoying. Maybe I'll see if sometings doable with this xm scraping.
Ian Patrick Hughes Friday, December 5, 2008
Ha! I was going to write a comment to whether that code was checked in anywhere, but you mentioned it. Ever since the herding code cast where everyone voted (if I recall correctly) on whether they checked in throw-away, its a topic of humor in our office.

When a new feature is added everyone asks: "Did this start as throwaway?" or "Did you check this in before you implemented into your main code line?"

Scraping is extremely useful, of course. While it's true there is really no silver bullet for HTML scraping because your dependent upon a 3rd party source you have no control over structure changes; I have used that to my advantage. By that I mean, the most frequent scraping requests I have received from higher-ups have been nefarious in nature. Generally used to avoid paying for data from a wholesaler. I have exaggerated the frequency the scrape target's structure would adjust to try and dissuade the project from ever starting.

blackhorus Monday, December 8, 2008
I did something similar with MediaMaster, but lack of time I put it aside.. The hard part was to find a decent free mp3 lib for .net that handles streaming.
scott Monday, December 8, 2008
@blackhorus - I'm going to use a silverlight or WPF MediaElement to handle the mms.
Comments are now closed.
by K. Scott Allen K.Scott Allen
My Pluralsight Courses
The Podcast!