Ebikeneser
Ebikeneser

Reputation: 2364

Retrieve HTML from links on page

I am using the following method to retrieve the source code from my website-

class WorkerClass1
{
    public static string getSourceCode(string url)
    {
        HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
        HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
        StreamReader sr = new StreamReader(resp.GetResponseStream());
        string sourceCode = sr.ReadToEnd();
        sr.Close();
        return sourceCode;
    }
}

And then implement the WorkerClass1 as so-

private void button1_Click(object sender, EventArgs e)
    {
        string url = textBox1.Text;
        string sourceCode = WorkerClass1.getSourceCode(url);
        StreamWriter sw = new StreamWriter(@"path");
        sw.Write(sourceCode);
        sw.Close();

    }

This works great and retrieves the HTML from my home page, however there are links at the bottom the page which I want to follow once the first page has been retrieved.

Is there a way I could modify my current code to do this?

Upvotes: 0

Views: 112

Answers (1)

Lars Holdgaard
Lars Holdgaard

Reputation: 9966

Yes of course.

What I would do is to read the HTML using a regular expression looking for links. For each match, I would put those links in a queue or similar data structure, and then use the same method for looking at that source.

Consider looking at HTMLAgilityPack for the parsing, it might be easier, even though looking for links should be quite simpele using Google.

Upvotes: 1

Related Questions