guitarPH
guitarPH

Reputation: 201

HTML Agility Pack - Filter Href Value Results

I'm working on a web scraper. The following text shows the results of the code given at the end of this question, which gets the values of all hrefs from a page.

I only want to get values that contain docid=

index.php?pageid=a45475a11ec72b843d74959b60fd7bd64556e8988583f

#

summary_of_documents.php

index.php?pageid=a45475a11ec72b843d74959b60fd7bd64579b861c1d7b

#

index.php?pageid=a45475a11ec72b843d74959b60fd7bd64579e0509c7f0&apform=judiciary

decisions.php?doctype=Decisions / Signed Resolutions&docid=1263778435388003271#sam

decisions.php?doctype=Decisions / Signed Resolutions&docid=12637789021669321156#sam

?doctype=Decisions / Signed Resolutions&year=1986&month=January#head

?doctype=Decisions / Signed Resolutions&year=1986&month=February#head

Here's the code:

        string url = urlTextBox.Text;
        string sourceCode = Extractor.getSourceCode(url);

        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(sourceCode);
        List<string> links = new List<string>();

        if (links != null)
        {
            foreach (HtmlAgilityPack.HtmlNode nd in doc.DocumentNode.SelectNodes("//a[@href]"))
            {
                links.Add(nd.Attributes["href"].Value);
            }
        }
        else
        {
            MessageBox.Show("No Links Found");
        }

        if (links != null)
        {
            foreach (string str in links)
            {
                richTextBox9.Text += str + "\n";
            }
        }
        else
        {
            MessageBox.Show("No Link Values Found");
        }

How can I do this?

Upvotes: 1

Views: 2237

Answers (1)

McGarnagle
McGarnagle

Reputation: 102753

Why not just replace this:

links.Add(nd.Attributes["href"].Value);

with this:

if (nd.Attributes["href"].Value.Contains("docid="))
    links.Add(nd.Attributes["href"].Value);

Upvotes: 2

Related Questions