OdeToCode IC Logo

Page Scraping

Thursday, June 29, 2006

Q: I want to programmatically retrieve a web page and parse out some information inside. What's the best approach?

A: For fetching the contents of a page, the simplest class to use is the System.Net.WebClient class. For more advanced scenarios that require cookies and simulated postbacks to the server, chances are you'll have to graduate to the System.Net.WebRequest and WebResponse classes. You'll find a lot of material on the web that demonstrate how to use these classes.

If you have to pull specific information out of a page, then the "best approach" will depend on the complexity of the page and nuances of the data. Once you have the contents of a page in a string variable, a few IndexOf() and Substring() method calls might be enough to parse out the data you need.

Many people use the RegEx class to find data inside of HTML. I'm not a fan of this approach, though. There are so many edge cases to contend with in HTML that the regular expressions grow hideously complex, and the regular expression language is notorious for being a "write-only" language.

My usual approach is to transform the web page into an object model. This sounds complicated, but not if someone else does all the heavy lifting. Two pieces of software that can help are the SgmlReader on GotDotNet, and Simon Mourier's Html Agility Pack. The agility pack is still a .NET 1.1 project, but I have it running under 2.0 with only minor changes (I just needed to remove some conditional debugging attributes). With these libraries, it is easy to walk through the page like an XML document, perform XSL transformations, or find data using XPath expressions (which to me are a lot more readable than regular expressions).

Here is a little snippet of code that uses the agility pack and will dump the text inside all of the links (<a> anchor tags) on OTC's front page.

WebRequest request = WebRequest.Create("https://odetocode.com");
using (WebResponse response = request.GetResponse())
    HtmlDocument document =
new HtmlDocument();

foreach (HtmlNode node in document.DocumentNode.SelectNodes("//a"))

The "Scraping" term in the title of this post comes from "screen scraping", which is a term almost as old as IT itself.