Wilson
Wilson

Reputation: 8768

using LINQ to filter a List<string> without changing the variable type

I am writing a web crawler in c#. Within the method to get all of the links on a page, i want to return the list of links, but 'filter' it with LINQ so that the list only contains urls that exist. I have a helper method written called RemoteFileExists that returns a boolean value. At the end of the method, I wrote the following LINQ line:

//Links is a List<string> that hasn't been filtered
return (from link in Links
                where RemoteFileExists(link)
                select link).ToList<string>();

For some reason, when I do this, the List is returned empty.

RemoteFileExists:

static bool RemoteFileExists(string url)
    {
        try
        {
            HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
            request.Method = "HEAD";
            HttpWebResponse response = request.GetResponse() as HttpWebResponse;
            return (response.StatusCode == HttpStatusCode.OK);
        }
        catch
        {
            return false;
        }

Upvotes: 0

Views: 580

Answers (2)

MatrixRonny
MatrixRonny

Reputation: 781

I have been using the RemoteFileExists method in my code. Sometimes the program hangs up because the request is not closed. Right now I am using the following code:

static bool RemoteFileExists(string url)
{
  try
  {
    HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
    request.Method = "HEAD";
    HttpWebResponse response = request.GetResponse() as HttpWebResponse;
    response.Close();

    return (response.StatusCode == HttpStatusCode.OK);
  }
  catch
  {
    return false;
  }
}

Also, the above code does not detect redirects. This is important to crawlers because you need to know when to advance to another page, instead of following redirects to the same page.

Upvotes: 0

I4V
I4V

Reputation: 35353

I guess either you links are not correct or your sites don't support HEAD. Since this code works

List<string> Links = new List<string>() {"http://www.google.com"};
var res = ( from link in Links
            where RemoteFileExists(link)
            select link).ToList<string>();

Upvotes: 3

Related Questions