Programmer's Journal: Screen scrapping using .Net

The following article was quite helpful in performing screen scrapping.

Highlights are as follows:

Before you can start screen scrapping, you need to know what is retrieved and what is posted by the browser. A free tool available to facilitate this is Fiddler.
If the website is an asp.net application, then you need to retrieve the view state.
There are some things that you can use as constant:

agent which tells the web server, what browser you are. In my case I use "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
content-type, in my case I use "application/x-www-form-urlencoded".

Most sites have cookies so you need to retrieve the cookie and post each time (since web application are typically stateless.

Here are some sample snippets:


    Const contentType As String = "application/x-www-form-urlencoded"
    Const usrAgent As String = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

    Dim cookieCntr As New CookieContainer
    Dim cookieColl As New CookieCollection
    Dim cCache As New CredentialCache
    Dim vCookie As Cookie
    Dim rqst As HttpWebRequest
    Dim rsp As HttpWebResponse
    Dim sReader As StreamReader
    Dim sWriter As StreamWriter
    Dim postData As String
    Dim rspStr As String
    Dim url As String

'=========== First Post

            rqst = WebRequest.Create(url)

            rqst.CookieContainer = cookieCntr
            'KeepAlive Setting
            rqst.KeepAlive = False
            rqst.AllowAutoRedirect = True
            rqst.UserAgent = usrAgent
            rqst.Credentials = cCache
            rsp = rqst.GetResponse

            sReader = New StreamReader(rsp.GetResponseStream)
            cookieCntr = rqst.CookieContainer
            rspStr = sReader.ReadToEnd
            cCache = rqst.Credentials

            sReader.Close()

'================ second post

            rqst = WebRequest.Create(url)
            rqst.AllowAutoRedirect = True
            rqst.CookieContainer = cookieCntr
            rqst.ContentType = contentType
            rqst.Credentials = cCache
            rqst.UserAgent = usrAgent
            rqst.Method = "POST"

            '  Assign the postData information here

            sWriter = New StreamWriter(rqst.GetRequestStream)
            sWriter.Write(postData)
            sWriter.Close()

           rsp = rqst.GetResponse
           cCache = rqst.Credentials
           rsp.Cookies = cookieColl
           sReader = New StreamReader(rsp.GetResponseStream)
           rspStr = sReader.ReadToEnd

           sReader.Close()

Noticed that all the credentials, cookies that are collected during the first post will need to be sent back in subsequent post.

This is especially true if the web site requires log in. In the above example, what we process are just strings. If there are images or other objects, then they need to be processed separately. If you use Fiddler, you will find that there are separate stream to obtain separate objects from the server. If you have 10 images, you will get 10 additional streams. That is also the reason a lot of sites are using CAPTCHA to prevent automated screen scrapping.

2 comments:

bdobur said...: Thank you for information.
Article link in the first line does not working anymore by the way.
Cheers.; April 13, 2011 at 10:49 PM
Strovek said...: I just tried the link and can still get to the article.

Hmmm... I wonder if it was just a glitch. Anyway, that was the reason why I copied the highlights so that there is enough info left in case the original article was deleted.

Thanks for highlighting and visiting.

By the way, there are new methods introduced (in the newer framework) which makes some of these much easier to do but I believe most of the essence are covered here.; April 18, 2011 at 5:16 PM

Programmer's Journal

Analytics

Thursday, June 12, 2008

Screen scrapping using .Net

2 comments: