Thursday, June 12, 2008

Screen scrapping using .Net

The following article was quite helpful in performing screen scrapping.

Highlights are as follows:
  • Before you can start screen scrapping, you need to know what is retrieved and what is posted by the browser. A free tool available to facilitate this is Fiddler.
  • If the website is an application, then you need to retrieve the view state.
  • There are some things that you can use as constant:
    • agent which tells the web server, what browser you are. In my case I use "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
    • content-type, in my case I use "application/x-www-form-urlencoded".
  • Most sites have cookies so you need to retrieve the cookie and post each time (since web application are typically stateless.
Here are some sample snippets:

Const contentType As String = "application/x-www-form-urlencoded"
Const usrAgent As String = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

Dim cookieCntr As New CookieContainer
Dim cookieColl As New CookieCollection
Dim cCache As New CredentialCache
Dim vCookie As Cookie
Dim rqst As HttpWebRequest
Dim rsp As HttpWebResponse
Dim sReader As StreamReader
Dim sWriter As StreamWriter
Dim postData As String
Dim rspStr As String
Dim url As String

'=========== First Post

rqst = WebRequest.Create(url)

rqst.CookieContainer = cookieCntr
'KeepAlive Setting
rqst.KeepAlive = False
rqst.AllowAutoRedirect = True
rqst.UserAgent = usrAgent
rqst.Credentials = cCache
rsp = rqst.GetResponse

sReader = New StreamReader(rsp.GetResponseStream)
cookieCntr = rqst.CookieContainer
rspStr = sReader.ReadToEnd
cCache = rqst.Credentials


'================ second post

rqst = WebRequest.Create(url)
rqst.AllowAutoRedirect = True
rqst.CookieContainer = cookieCntr
rqst.ContentType = contentType
rqst.Credentials = cCache
rqst.UserAgent = usrAgent
rqst.Method = "POST"

' Assign the postData information here

sWriter = New StreamWriter(rqst.GetRequestStream)

rsp = rqst.GetResponse
cCache = rqst.Credentials
rsp.Cookies = cookieColl
sReader = New StreamReader(rsp.GetResponseStream)
rspStr = sReader.ReadToEnd


Noticed that all the credentials, cookies that are collected during the first post will need to be sent back in subsequent post.

This is especially true if the web site requires log in. In the above example, what we process are just strings. If there are images or other objects, then they need to be processed separately. If you use Fiddler, you will find that there are separate stream to obtain separate objects from the server. If you have 10 images, you will get 10 additional streams. That is also the reason a lot of sites are using CAPTCHA to prevent automated screen scrapping.


bdobur said...

Thank you for information.
Article link in the first line does not working anymore by the way.

Strovek said...

I just tried the link and can still get to the article.

Hmmm... I wonder if it was just a glitch. Anyway, that was the reason why I copied the highlights so that there is enough info left in case the original article was deleted.

Thanks for highlighting and visiting.

By the way, there are new methods introduced (in the newer framework) which makes some of these much easier to do but I believe most of the essence are covered here.