OdeToCode IC Logo

Page Scraping

Thursday, June 29, 2006

Q: I want to programmatically retrieve a web page and parse out some information inside. What's the best approach?

A: For fetching the contents of a page, the simplest class to use is the System.Net.WebClient class. For more advanced scenarios that require cookies and simulated postbacks to the server, chances are you'll have to graduate to the System.Net.WebRequest and WebResponse classes. You'll find a lot of material on the web that demonstrate how to use these classes.

If you have to pull specific information out of a page, then the "best approach" will depend on the complexity of the page and nuances of the data. Once you have the contents of a page in a string variable, a few IndexOf() and Substring() method calls might be enough to parse out the data you need.

Many people use the RegEx class to find data inside of HTML. I'm not a fan of this approach, though. There are so many edge cases to contend with in HTML that the regular expressions grow hideously complex, and the regular expression language is notorious for being a "write-only" language.

My usual approach is to transform the web page into an object model. This sounds complicated, but not if someone else does all the heavy lifting. Two pieces of software that can help are the SgmlReader on GotDotNet, and Simon Mourier's Html Agility Pack. The agility pack is still a .NET 1.1 project, but I have it running under 2.0 with only minor changes (I just needed to remove some conditional debugging attributes). With these libraries, it is easy to walk through the page like an XML document, perform XSL transformations, or find data using XPath expressions (which to me are a lot more readable than regular expressions).

Here is a little snippet of code that uses the agility pack and will dump the text inside all of the links (<a> anchor tags) on OTC's front page.

WebRequest request = WebRequest.Create("https://odetocode.com");
using (WebResponse response = request.GetResponse())
    HtmlDocument document =
new HtmlDocument();

foreach (HtmlNode node in document.DocumentNode.SelectNodes("//a"))

The "Scraping" term in the title of this post comes from "screen scraping", which is a term almost as old as IT itself.

jonnosan Thursday, June 29, 2006
Yes, I know I'm a ruby fanboy. But I'd seriously suggest anyone thinking about screen scraping look at ruby, including the Watir toolkit (http://wtr.rubyforge.org/) and RubyFul Soup (http://www.crummy.com/software/RubyfulSoup/).

Here's the equivalent ruby code for your example:

require 'watir'
require 'rubyful_soup'
soup.find_all("a").each{|a| puts a.contents}

This doesn't really show the real benefits of watir/rubyful_soup though, which are
1) since watir drives IE, it automatically handles javascript and redirects
2) watir makes it easy to scrape from sites that use POSTs to forms.
3) rubyful_soup makes it easy to find tags by their position to other tags (as in find the heading that comes before the 2nd table)

Plus, if you use ruby, you can turn into another gibbering fanboy like me
scott Thursday, June 29, 2006
That does look cool, and its something I need to check out.
Gary Thursday, June 29, 2006
I spent about a year writing a complete screen screen scrape and replay tool for a company I worked for. Walking the DOM will allow you to get every piece of information that is available but it is a very tedious process.

The need to find all possible input fields, hook events so that you don't miss those fields (dropdowns) that collect data, fire an even and clear themselves it a rather large undertaking.

I am trying to install watir right now to see what it can do.

The tool that I wrote allows for the saving of scripts, playback, recording of performance metrics, reporting of this data and things like adding customer wait time to reproduce how a user would actually use a web site.

Depending on your needs Ruby may end up being a much more simple way to go. There are tons of little thinks that pop up if you go the screen scrape route.
Jeff Atwood Thursday, June 29, 2006

But I see jonnosan already beat me to it..

Also, I wouldn't be so quick to dismiss regex for common parsing tasks.

Unless you have pathological HTML (and usually, if you do, you just bail on that page anyway) it can work quite well.
Comments are closed.