Reputation: 109
I am trying to scrape all href from source code in tag and having class = "linked formlink" . I dont understand what I am doing wrong.I am getting null in the "links".
StreamReader sr = new StreamReader(webBrowser1.DocumentStream);
string sourceCode = sr.ReadToEnd();
sr.Close();
//removing illegal path
string regexSearch = new string(Path.GetInvalidFileNameChars()) + new string(Path.GetInvalidPathChars());
Regex r = new Regex(string.Format("[{0}]", Regex.Escape(regexSearch)));
sourceCode = r.Replace(sourceCode, "");
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(sourceCode);
var links = htmlDoc.DocumentNode
.Descendants("a")
.Where(x => x.Attributes["class"] != null
&& x.Attributes["class"].Value == "linked formlink")
.Select(x => x.Attributes["href"].Value.ToString());
Upvotes: 0
Views: 88
Reputation: 49
the regular expression is removing the brackets plus other necessary characters used by the html-agile-pack to determine tags and classes
just remove it
Upvotes: 2