Piyush Chandra
Piyush Chandra

Reputation: 109

Trying to scrape all href from source code. I dont understand what I am doing wrong

I am trying to scrape all href from source code in tag and having class = "linked formlink" . I dont understand what I am doing wrong.I am getting null in the "links".

StreamReader sr = new StreamReader(webBrowser1.DocumentStream);
        string sourceCode = sr.ReadToEnd();
        sr.Close();
        //removing illegal path 

        string regexSearch = new string(Path.GetInvalidFileNameChars()) +        new string(Path.GetInvalidPathChars());
        Regex r = new Regex(string.Format("[{0}]", Regex.Escape(regexSearch)));
        sourceCode = r.Replace(sourceCode, "");

        HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
        htmlDoc.LoadHtml(sourceCode);

        var links = htmlDoc.DocumentNode
                          .Descendants("a")
                          .Where(x => x.Attributes["class"] != null
                                   && x.Attributes["class"].Value == "linked formlink")
                           .Select(x => x.Attributes["href"].Value.ToString());

Upvotes: 0

Views: 88

Answers (1)

Modar Na
Modar Na

Reputation: 49

the regular expression is removing the brackets plus other necessary characters used by the html-agile-pack to determine tags and classes

just remove it

Upvotes: 2

Related Questions