dotnet-practitioner
dotnet-practitioner

Reputation: 14148

c#: crawler project

Could I get very easy to follow code examples on the following:

  1. Use browser control to launch request to a target website.
  2. Capture the response from the target website.
  3. convert response into DOM object.
  4. Iterate through DOM object and capture things like "FirstName" "LastName" etc if they are part of response.

thanks

Upvotes: 2

Views: 1407

Answers (5)

KallDrexx
KallDrexx

Reputation: 27803

If you want a pure C# way to traverse web pages, a good place to look is WatiN. It allows you to easily open a web browser and go through the web page (and actions) via C# code.

Here's an example for searching google with the API (taken from their docs)

using System;
using WatiN.Core;

namespaceWatiNGettingStarted
{
  class WatiNConsoleExample
  {
    [STAThread]
    static void Main(string[] args)
    {
      // Open a new Internet Explorer window and
      // goto the google website.
      IE ie = new IE("http://www.google.com");

      // Find the search text field and type Watin in it.
      ie.TextField(Find.ByName("q")).TypeText("WatiN");

      // Click the Google search button.
      ie.Button(Find.ByValue("Google Search")).Click();

      // Uncomment the following line if you want to close
      // Internet Explorer and the console window immediately.
      //ie.Close();
    }
  }

}

Upvotes: 1

Darin Dimitrov
Darin Dimitrov

Reputation: 1038790

You may take a look at Html Agility Pack and/or SgmlReader. Here's an example using SgmlReader which selects all the nodes in the DOM containing some text:

class Program
{
    static void Main()
    {
        using (var reader = new SgmlReader())
        {
            reader.Href = "http://www.microsoft.com";
            var doc = new XmlDocument();
            doc.Load(reader);
            var nodes = doc.SelectNodes("//*[contains(text(), 'Products')]");
            foreach (XmlNode node in nodes)
            {
                Console.WriteLine(node.OuterXml);
            }
        }
    }
}

Upvotes: 1

Mohamad Alhamoud
Mohamad Alhamoud

Reputation: 4929

Here you can find a tutorial from 4 parts to what you want.

this is the first one , the 4 parts are here (How to Write a Search Engine)

Upvotes: 1

Tim M.
Tim M.

Reputation: 54377

Here is code that uses a WebRequest object to retrieve data and captures the response as a stream.

    public static Stream GetExternalData( string url, string postData, int timeout )
    {
        ServicePointManager.ServerCertificateValidationCallback += delegate( object sender,
                                                                                X509Certificate certificate,
                                                                                X509Chain chain,
                                                                                SslPolicyErrors sslPolicyErrors )
        {
            // if we trust the callee implicitly, return true...otherwise, perform validation logic
            return [bool];
        };

        WebRequest request = null;
        HttpWebResponse response = null;

        try
        {
            request = WebRequest.Create( url );
            request.Timeout = timeout; // force a quick timeout

            if( postData != null )
            {
                request.Method = "POST";
                request.ContentType = "application/x-www-form-urlencoded";
                request.ContentLength = postData.Length;

                using( StreamWriter requestStream = new StreamWriter( request.GetRequestStream(), System.Text.Encoding.ASCII ) )
                {
                    requestStream.Write( postData );
                    requestStream.Close();
                }
            }

            response = (HttpWebResponse)request.GetResponse();
        }
        catch( WebException ex )
        {
            Log.LogException( ex );
        }
        finally
        {
            request = null;
        }

        if( response == null || response.StatusCode != HttpStatusCode.OK )
        {
            if( response != null )
            {
                response.Close();
                response = null;
            }

            return null;
        }

        return response.GetResponseStream();
    }

For managing the response, I have a custom Xhtml parser that I use, but it is thousands of lines of code. There are several publicly available parsers (see Darin's comment).

EDIT: per the OP's question, headers can be added to the request to emulate a user agent. For example:

request = (HttpWebRequest)WebRequest.Create( url );
                request.Accept = "application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/x-shockwave-flash, */*";
                request.Timeout = timeout;
                request.Headers.Add( "Cookie", cookies );

                //
                // manifest as a standard user agent
                request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US)";

Upvotes: 2

Chris Kooken
Chris Kooken

Reputation: 33870

You could also use selenium to easily traverse the DOM and grab the values of the fields. It will also automatically open the browser for you.

Upvotes: 0

Related Questions