Reputation: 201
I'm working on a web scraper. The following text shows the results of the code given at the end of this question, which gets the values of all hrefs from a page.
I only want to get values that contain docid=
index.php?pageid=a45475a11ec72b843d74959b60fd7bd64556e8988583f
#
summary_of_documents.php
index.php?pageid=a45475a11ec72b843d74959b60fd7bd64579b861c1d7b
#
index.php?pageid=a45475a11ec72b843d74959b60fd7bd64579e0509c7f0&apform=judiciary
decisions.php?doctype=Decisions / Signed Resolutions&docid=1263778435388003271#sam
decisions.php?doctype=Decisions / Signed Resolutions&docid=12637789021669321156#sam
?doctype=Decisions / Signed Resolutions&year=1986&month=January#head
?doctype=Decisions / Signed Resolutions&year=1986&month=February#head
Here's the code:
string url = urlTextBox.Text;
string sourceCode = Extractor.getSourceCode(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceCode);
List<string> links = new List<string>();
if (links != null)
{
foreach (HtmlAgilityPack.HtmlNode nd in doc.DocumentNode.SelectNodes("//a[@href]"))
{
links.Add(nd.Attributes["href"].Value);
}
}
else
{
MessageBox.Show("No Links Found");
}
if (links != null)
{
foreach (string str in links)
{
richTextBox9.Text += str + "\n";
}
}
else
{
MessageBox.Show("No Link Values Found");
}
How can I do this?
Upvotes: 1
Views: 2237
Reputation: 102753
Why not just replace this:
links.Add(nd.Attributes["href"].Value);
with this:
if (nd.Attributes["href"].Value.Contains("docid="))
links.Add(nd.Attributes["href"].Value);
Upvotes: 2