Friday, April 26, 2013

Webrequest url not returning correct webpage source if proxynull isn't use as part of url query string (Web Scraping)

In one of the sites im crawling, I encountered a situation where
a site needs a query string like proxynull = 90B69303-3A61-4482-AF0725FDA1DAE548 or appended into a url like this http://samplesite/bin/jobs_list.cfm?proxynull=90B69303-3A61-4482-AF0725FDA1DAE548

I wonder if i could just use the post data and use the url without the proxynull query string like this http://samplesite/bin/jobs_list.cfm to scrape the website.

After series of experimentation, the solution is to set the webproxy of the webrequest object to default proxy similar to the code below:
Code:
((HttpWebRequest)webRequest).Proxy = WebRequest.DefaultWebProxy; 

in order to use the url(http://samplesite/bin/jobs_list.cfm) without proxynull.

Cheers!
Greg

Monday, April 15, 2013

Cannot find JavaScriptSerializer in .Net 4.0 (REPOST)

These are the steps for using it in .NET 4.0

1. Create a new console application
2. Change the target to dot.net 4 instead of Client Profile
3. Add a reference to System.Web.Extensions (4.0)
4. Got access to JavaScriptSerializer in Program.cs now :-)

Source: Cannot Find Javascript Serializer in .NET 4.0

Thursday, April 11, 2013

Remove HTML tags in an XML String Document using REGEX (C#)

Here's a regex pattern that will match html tags that are present in
an xml string document. Where xml, node1, node2, node3, node4, node5, node6 and node7 are xml tags. node1 could represent a valid xml tag name like employername or similar tag names.
Code:
 xmlStringData = Regex.Replace(xmlStringData, @"<((\??)|(/?)|(!))\s?\b(?!  
 (\b(xml|node1||node2|node3|node4|node5|node6|node7)\b))\b[^>]*((/?))>", " ",  
 RegexOptions.IgnoreCase);

Note: This is only applicable for small xml files. Using this pattern to large xml files will cause memory exception.
Greg