Home   |  Articles   |  Resources   |  Humor   |  Feedback       

  Login   Register 

Ads Via DevMavens


Screen Scraping, ViewState, and Authentication using ASP.Net

Posted by on Saturday, July 03, 2004

This article will examine three options to fetch HTML output from a URL, including how to fetch the output of an ASPX page using Viewstate and forms based authentication.

Before web services came along, screen scraping was a popular technique for grabbing the output from another application by examining the text it displays on the screen. For web applications, this meant making a request to a URL and examining the HTML the server returns. You could then parse the HTML to grab the latest news headlines or stock quotes from a news site, or the price of a book on amazon.com.

With RSS, XML, and Web Services, the need to screen scrape has diminished, but is not extinct. In this article we will examine a few methods to grab the HTML from another URL and for display in your own page.

HttpServerUtility

If the page you need to fetch is part of the current web application, you can use the execute method on the Server object of the current page. The Server object is of type HttpServerUtility, which also includes the well-known methods Transfer and MapPath. Using execute is straightforward:

TextWriter textWriter = new StringWriter();
Server.Execute("myOtherPage.aspx", textWriter);
Response.Output.Write(textWriter.ToString());

You can use Server.Execute to add content to frames, or devise print friendly pages. We generally would not want to write the entire contents of the resulting string into the response as we have in this sample, but instead would parse select content from myOtherPage.aspx. Of course, we are not always so lucky to have the resource we need inside of the same web application, and this is where classes from the System.Net namespace come into play.

WebClient

The WebClient class presents the simplest API possible for retrieving content from a URL, as seen below.

using(WebClient webClient = new WebClient())
{
   byte[] response = webClient.DownloadData(THEURL);
   Response.OutputStream.Write(response, 0, response.Length);   
}

We need only three lines of code, but this time instead of passing the name of an ASPX page inside of our application, we can pass the URL to a remote resource, like http://www.OdeToCode.com/default.aspx.

The next hurdle you might face is retrieving content from a web site requiring forms authentication. Forms authentication usually requires a user to enter credentials into a form and press a submit button. Pressing submit will cause the browser to perform an HTTP “POST” and send the form values, such as the username and password, in the message body to the server (for more information on GET and POST see the resource section at the bottom of the article).

As an example, consider the source code for the following login form:

<form name="Form1" method="post" action="login.aspx" id="Form1">
<P>Username
    <input name="UsernameTextBox" type="text" id="UsernameTextBox" /></P>
<P>Password
    <input name="PasswordTextBox" type="text" id="PasswordTextBox" /></P>
<P>
    <input type="submit" name="LoginButton" value="Login" id="LoginButton" /></P>
</form>

In the message body of the browser POST, the form values could appear like so:

UsernameTextBox=scott&PasswordTextBox=scott&LoginButton=Login

When this payload arrives at the server, the code will know the user entered ‘scott’ into the username textbox, ‘scott’ in the password text box, and posted the form using the Login button. We can use the WebClient class to simulate a POST for this form with the following code.

WebClient webClient = new WebClient();                  
webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded");
byte[] response = webClient.UploadData(
      LOGIN_URL, "POST", Encoding.ASCII.GetBytes(postData)
   );

However, trying to POST to an ASP.NET page will usually involve one more obstacle: the Viewstate. We will not be covering Viewstate in this article (see resources below), except we need to know how to correctly POST the Viewstate back to the server. ASP.NET sends Viewstate to the client in a hidden form field, and we must parse out the correct value in order to submit the login form programmatically. If we view the source for a login web form like the form above in ASP.NET, we will see the following appear just after the opening form tag:

<input type="hidden" name="__VIEWSTATE"
 value="dDwtMzg4MDA0NzA7Oz5c3QucjNFeAIFsjceZk8ndLkr4yA==" /> 

You might be asking what else might appear in a form, and what is the easiest way to see what the browser sends to the server? If you are going to do any nontrivial screen-scraping, sooner or later you will need to answer this question and debug problems. The easiest way to debug is to use a tool like Fiddler, which will show you every request and response between your machine and a web server. You can inspect the headers and message content, and watch exactly what happens when your browser performs a POST, then try to replicate the behavior programmatically.

In order to send the correct Viewstate value to the server, we will first need to request the form from the server, parse the Viewstate, and then POST the form back. Let’s try this in our next example.

byte[] response;

WebClient webClient = new WebClient();
response = webClient.DownloadData(LOGIN_URL);

string viewstate = ExtractViewState(
      Encoding.ASCII.GetString(response)
   );

string postData = String.Format(
   "__VIEWSTATE={0}&UsernameTextBox={1}&PasswordTextBox={2}&LoginButton=Login",
   viewstate, USERNAME, PASSWORD);

webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded");
response = webClient.UploadData(
        LOGIN_URL, "POST", Encoding.ASCII.GetBytes(postData)
    );

Now we have a lot more activity happening. First, we request the login form, then we parse out the Viewstate value (more on this coming up). Once we have the Viewstate, we can create a string (postData) with the form values. We have not mentioned the reason for adding the Content-Type header, but if you use the Fiddler tool this will be one of those small details you might notice as a difference between your programmatic POST and the browser POST, and is required for POST to work.

We can parse out the Viewstate value with some string manipulation. First, we will find the location of the identifier __VIEWSTATE, then identify the string after the identifier and between the double quotes of the value attribute.

private string ExtractViewState(string s)
{
   string viewStateNameDelimiter = "__VIEWSTATE";
   string valueDelimiter = "value=\"";
            
   int viewStateNamePosition = s.IndexOf(viewStateNameDelimiter);     
   int viewStateValuePosition = s.IndexOf(
         valueDelimiter, viewStateNamePosition
      );

   int viewStateStartPosition = viewStateValuePosition + 
                                valueDelimiter.Length;
   int viewStateEndPosition = s.IndexOf("\"", viewStateStartPosition);

   return HttpUtility.UrlEncodeUnicode(
            s.Substring(
               viewStateStartPosition, 
               viewStateEndPosition - viewStateStartPosition
            )
         );  
}

Notice the use of URL encoding to make sure the server misinterprets no characters with a special meaning (like the equal sign).

If you are familiar with forms authentication in ASP.NET you’ll know the runtime issues a cookie to the browser when a user has successfully authenticated themselves. On subsequent requests, the browser needs to pass along the cookie value to reach protected resources. Unfortunately, I have not found an easy way for the WebClient to work with cookie values, so we will try a more advanced API with the HttpWebRequest class.

HttpWebRequest

The code using HttpWebRequest will look a bit different than what we have seen with WebClient. HttpWebRequest uses streams to write form values into the request and read the response. We also need to add some code to handle the forms authentication cookie. This final code example will successfully login to a website and pull the HTML from a protected resource.

private void Button5_Click(object sender, System.EventArgs e)
{
   // first, request the login form to get the viewstate value
   HttpWebRequest webRequest = WebRequest.Create(LOGIN_URL) as HttpWebRequest;         
   StreamReader responseReader = new StreamReader(
         webRequest.GetResponse().GetResponseStream()
      );
   string responseData = responseReader.ReadToEnd();         
   responseReader.Close();
   
   // extract the viewstate value and build out POST data
   string viewState = ExtractViewState(responseData);       
   string postData = 
         String.Format(
            "__VIEWSTATE={0}&UsernameTextBox={1}&PasswordTextBox={2}&LoginButton=Login",
            viewState, USERNAME, PASSWORD
         );
  
   // have a cookie container ready to receive the forms auth cookie
   CookieContainer cookies = new CookieContainer();

   // now post to the login form
   webRequest = WebRequest.Create(LOGIN_URL) as HttpWebRequest;
   webRequest.Method = "POST";
   webRequest.ContentType = "application/x-www-form-urlencoded";
   webRequest.CookieContainer = cookies;        
   
   // write the form values into the request message
   StreamWriter requestWriter = new StreamWriter(webRequest.GetRequestStream());
   requestWriter.Write(postData);
   requestWriter.Close();
   
   // we don't need the contents of the response, just the cookie it issues
   webRequest.GetResponse().Close();
   
   // now we can send out cookie along with a request for the protected page
   webRequest = WebRequest.Create(SECRET_PAGE_URL) as HttpWebRequest;
   webRequest.CookieContainer = cookies;
   responseReader = new StreamReader(webRequest.GetResponse().GetResponseStream());
   
   // and read the response
   responseData = responseReader.ReadToEnd();
   responseReader.Close();
   
   Response.Write(responseData);         
}

If you have been following along, the above code should make some sense, even though the HttpWebRequest class requires us to do a more work. For instance, instead of using the UploadData method of WebClient to POST and have a response, we need to get the request stream, write into the request stream, get the response stream, and read from the response stream. Notice the use of the CookieContainer class to keep the authentication ticket alive in our request.

Conclusion

Until every single web site on the Internet offers a web service to programmatically retrieve data, screen scraping will be around. It’s good to know a few tricks to fetch content from the web using code and classes in the .NET framework.

By K. Scott Allen

Additional Resources

Understanding ASP.NET View State
Methods GET and POST in HTML forms - what's the difference?

Comments:

Not really working for me...
By bmatthews on 7/12/2004
I am trying to make a screen scrapper for www.xanga.com so people can be notified of page updates on certain blogs but I am fairly sure my problem lies in this section:

String.Format(
"__VIEWSTATE={0}&UsernameTextBox={1}&PasswordTextBox={2}&LoginButton=Login",
viewState, USERNAME, PASSWORD
);

I did move this over to VB.NET but I doubt that should make a big difference, I simply took your methods and used the same classes, etc and just changed the syntax. I simply don't believe I am logging in correctly. Any help you could provide would be much appreciated.

Screen Scraping
By JoeF on 7/16/2004
I was struggling with this exact problem this week! Your article for showing how to submit a request for a page that is "protected" by forms authentication was excellent! The code "just worked". The logic was also very well explained and as always the answer appears simple once you see it! You even covered the viewstate issue perfectly. I was able to use this code to perform the login and retrieve the cookie and then submit my real request with the new cookie. It worked!

I was testing httpCompression and I wanted to see it work on the pages in a protected area. So I submit two requests, one that asks for the uncompressed page and the other that asks for the compressed version. That part of the code was working. Getting around the Forms Authentication was the pill.

Thanks for the great article!

Joe

Screenscraping
By mumfie on 9/1/2004
Very useful article Scott.
I have been trying to do similar to precompile protected pages to improve response times when first accessed.

Problem with image source
By aman_molder on 9/15/2004
Screen scrapping is fairly simple but I've been stuck on one problem. After I've downloaded the HTML, the image src is not complete but relative. i.e. it should look like:
<img src=http://www.site.com/images/logo.jpg>
but instead it looks like:
<img src=images/logo.jpg>

Any ideads as to how I can get the complete path instead of the relative path?

Thanks
- Aman

Server returned BAD REQUEST error
By LISA on 10/26/2004
I tried the above code but I keep getting this error on the:

webRequest.GetResponse().Close();

line. This kind of error is usually associated with a bad string format, but I double checked and there is nothing wrong with my postData string. Am I missing something?

Two URL's
By sherritp on 11/12/2004
The difference between the two URL's above (LOGIN_URL and SECRET_PAGE_URL) weren't mentioned. Here's what I seemed to figure out:

LOGIN_URL is the initial page with the form.
SECRET_PAGE_URL is the desired page you want to scrape.

Copyright 2004 OdeToCode.com 


The Blogs
Subscribe to the OdeToCode blogs for the latest news, downloads, new articles, and quirky commentary.
New Articles
C# 3.0 and LINQ
C# 3.0 introduced a number of new features for LINQ. In this article we'll examine the new features like extension methods, lambda expressions, anonymous types, and more.

Introduction To LINQ
This article is an introduction to LINQ and provides examples of using LINQ to query objects, XML, and relational data.

What ASP.NET Developers Should Know About JavaScript
This article looks at JavaScript from the perspective of a C# or Visual Basic programmer. See how to apply object oriented techniques to your JavaScript code.

Most Popular Articles
ASP.Net 2.0 - Master Pages: Tips, Tricks, and Traps
MasterPages are a great addition to the ASP.NET 2.0 feature set, but are not without their quirks. This article will highlight the common problems developers face with master pages, and provide tips and tricks to use master pages to their fullest potential.

Table Variables In T-SQL
Table variables allow you to store a resultset in SQL Server without the overhead of declaring and cleaning up a temporary table. In this article, we will highlight the features and advantages of the table variable data type.

AppSettings In web.config
In this article we will review a couple of pratices to keep your runtime configuration information flexible.

Contribute Code
Privacy
Consultancy