Reputation: 8768
I am writing a web crawler in c#. Within the method to get all of the links on a page, i want to return the list of links, but 'filter' it with LINQ so that the list only contains urls that exist. I have a helper method written called RemoteFileExists that returns a boolean value. At the end of the method, I wrote the following LINQ line:
//Links is a List<string> that hasn't been filtered
return (from link in Links
where RemoteFileExists(link)
select link).ToList<string>();
For some reason, when I do this, the List is returned empty.
RemoteFileExists:
static bool RemoteFileExists(string url)
{
try
{
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
request.Method = "HEAD";
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
return (response.StatusCode == HttpStatusCode.OK);
}
catch
{
return false;
}
Upvotes: 0
Views: 580
Reputation: 781
I have been using the RemoteFileExists method in my code. Sometimes the program hangs up because the request is not closed. Right now I am using the following code:
static bool RemoteFileExists(string url)
{
try
{
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
request.Method = "HEAD";
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
response.Close();
return (response.StatusCode == HttpStatusCode.OK);
}
catch
{
return false;
}
}
Also, the above code does not detect redirects. This is important to crawlers because you need to know when to advance to another page, instead of following redirects to the same page.
Upvotes: 0
Reputation: 35353
I guess either you links are not correct or your sites don't support HEAD
. Since this code works
List<string> Links = new List<string>() {"http://www.google.com"};
var res = ( from link in Links
where RemoteFileExists(link)
select link).ToList<string>();
Upvotes: 3