c# substring - parse all text in between

Question

trying to parse all text (mainly the url) from the html code below. but i would only like to grab the url between these div tags (result-firstline-title) and (result-url js-result-url) for each(all) occurrences.

to be clear, i am able to grab all the url from the html source below, but the problem is it is also grabbing the url almost 3 times. and for that i have a fix which to remove duplicate urls, however, if you look carefully to the html source, you will see that it also grabs the 3rd url.



    
        
            The Top Social Networking Sites People Are Using
        
    



https://www.lifewire.com/top-social-networking-sites-people-are...


    The Top




    
        
            
                Top 15 Most Popular Social Networking Sites | January 2019
            
        

    

    www.ebizmba.com/articles/social-networking-websites
    
    
        Top 15 Most

i have tried the following c# code to grab the text between the div tags but it grabs everything, which i dont want.

        int urlTagFrom = rawHTMLFromSource.IndexOf("result-firstline-title") + "result-firstline-title".Length;
        int urlTagTo = rawHTMLFromSource.LastIndexOf("result-url js-result-url");
        urlTagCollection = rawHTMLFromSource.Substring(urlTagFrom, urlTagTo - urlTagFrom);

to grab url i am using the following:

var regexURLParser = new Regex(@"(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?", RegexOptions.Singleline | RegexOptions.CultureInvariant);

what i want is to grab is the url from these:

so that the outcome shows only:

https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554
http://www.ebizmba.com/articles/social-networking-websites

Umair Anwaar · Accepted Answer

You can make it more easier by using HTMLAgilityPack just include it in your project using NuGet.

To add HTMLAgilityPack using NuGet

go to the Package Manager Console and type Install-Package HtmlAgilityPack -Version 1.11.3

after the installation you can extract Urls like below.

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(@"put html string here");

var listOfUrls = new List();
doc.DocumentNode.SelectNodes("//a").ToList()
   .ForEach(x=> 
           {
              //Use HasClass method to filter elements 
              if (!string.IsNullOrEmpty(x.GetAttributeValue("href", "")) 
                   && x.HasClass("result-title") && x.HasClass("js-result-title"))
              {
                 listOfUrls.Add(x.GetAttributeValue("href", ""));
              }
           });

listOfUrls.ForEach(x => Console.WriteLine(x));

EDIT

Added && x.HasClass("result-title") && x.HasClass("js-result-title") to shows only those elements which has the class result-title and js-result-title.

Another way

shorter and another way to get filtered values.

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(@"put html string here");

var listOfUrls = doc.DocumentNode.Descendants("a")
    .Where(x => x.Attributes["class"] != null 
                && x.Attributes["class"].Value == "result-title js-result-title")
    .Select(x => x.GetAttributeValue("href", "")).ToList();

c# substring - parse all text in between

Answers (1)

Related Questions