Buddhihin
Buddhihin

Reputation: 39

c# substring - parse all text in between

trying to parse all text (mainly the url) from the html code below. but i would only like to grab the url between these div tags (result-firstline-title) and (result-url js-result-url) for each(all) occurrences.

to be clear, i am able to grab all the url from the html source below, but the problem is it is also grabbing the url almost 3 times. and for that i have a fix which to remove duplicate urls, however, if you look carefully to the html source, you will see that it also grabs the 3rd url.

<div class="result js-result card-mobile ">
<div class="result-firstline-container">
    <div class="result-firstline-title">
        <a
            class="result-title js-result-title"

            href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"

        >
            The Top Social Networking Sites People Are Using
        </a>
    </div>

</div>

<a
    class="result-url js-result-url"

    href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554">https://www.lifewire.com/top-<b>social-networking-sites</b>-people-are...
</a>
<p class="result-snippet">
    The Top
</p>
</div>

<div class="result js-result card-mobile ">
    <div class="result-firstline-container">
        <div class="result-firstline-title">
            <a
                class="result-title js-result-title"

                href="http://www.ebizmba.com/articles/social-networking- websites"

            >
                Top 15 Most Popular Social Networking Sites | January 2019
            </a>
        </div>

    </div>

    <a
        class="result-url js-result-url"

        href="http://www.ebizmba.com/articles/social-networking- websites">www.ebizmba.com/articles/<b>social-networking</b>-<b>websites</b>
    </a>
    <p class="result-snippet">
        Top 15 Most 
    </p>

</div>     

i have tried the following c# code to grab the text between the div tags but it grabs everything, which i dont want.

        int urlTagFrom = rawHTMLFromSource.IndexOf("result-firstline-title") + "result-firstline-title".Length;
        int urlTagTo = rawHTMLFromSource.LastIndexOf("result-url js-result-url");
        urlTagCollection = rawHTMLFromSource.Substring(urlTagFrom, urlTagTo - urlTagFrom);

to grab url i am using the following:

var regexURLParser = new Regex(@"(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?", RegexOptions.Singleline | RegexOptions.CultureInvariant);

what i want is to grab is the url from these:

        <a
            class="result-title js-result-title"

            href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"

        >

        <a
            class="result-title js-result-title"

            href="http://www.ebizmba.com/articles/social-networking-websites"

        >

so that the outcome shows only:

https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554
http://www.ebizmba.com/articles/social-networking-websites 

Upvotes: 1

Views: 123

Answers (1)

Umair Anwaar
Umair Anwaar

Reputation: 1138

You can make it more easier by using HTMLAgilityPack just include it in your project using NuGet.

To add HTMLAgilityPack using NuGet

go to the Package Manager Console and type Install-Package HtmlAgilityPack -Version 1.11.3

after the installation you can extract Urls like below.

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(@"put html string here");

var listOfUrls = new List<string>();
doc.DocumentNode.SelectNodes("//a").ToList()
   .ForEach(x=> 
           {
              //Use HasClass method to filter elements 
              if (!string.IsNullOrEmpty(x.GetAttributeValue("href", "")) 
                   && x.HasClass("result-title") && x.HasClass("js-result-title"))
              {
                 listOfUrls.Add(x.GetAttributeValue("href", ""));
              }
           });

listOfUrls.ForEach(x => Console.WriteLine(x));

EDIT

Added && x.HasClass("result-title") && x.HasClass("js-result-title") to shows only those elements which has the class result-title and js-result-title.

Another way

shorter and another way to get filtered values.

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(@"put html string here");

var listOfUrls = doc.DocumentNode.Descendants("a")
    .Where(x => x.Attributes["class"] != null 
                && x.Attributes["class"].Value == "result-title js-result-title")
    .Select(x => x.GetAttributeValue("href", "")).ToList();

Upvotes: 2

Related Questions