Baks
Baks

Reputation: 37

Getting an list which contains specific text using regex

I am trying to get the list of "ul" which contains the term "[My search Text]" inside it.

I have tried using the below regex but its not returning me the proper output,

<ul[^>]*>\s*?\w+?(.|\n).*(\[My search Text\]).*(.|\n).+</ul>

Input :

<ul><li>[My search Text] is required  </li></ul>
<ul><li>[My edit Text] is not required </li></ul>
<ul><li><b>[My search Text] is mandatory </b> </li>    </ul>
<ul><li><strong>[My search Text] is so mandatory </strong> </li></ul>

Desired Output :

<ul><li>[My search Text] is required  </li></ul>  
<ul><li><b>[My search Text] is mandatory </b> </li>    </ul>
<ul><li><strong>[My search Text] is so mandatory </strong> </li></ul>

Thanks in advance

Upvotes: 1

Views: 226

Answers (2)

amit dayama
amit dayama

Reputation: 3326

Try:(for text inside ui)

 <ul>*.+(\[My search Text\]).+</ul>

for text inside li:

<ul>*.<li>*.+(\[My search Text\]).+<\/li>*.*<\/ul>

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626802

A note on your regex:

  • <ul[^>]*> - should work OK,
  • \s*? - no need to use a lazy quantifier
  • \w+? - same, no need in lazy matching,
  • (.|\n) - this makes no sense since it matches any symbol once
  • .* - 0 or more characters other than a newline as many as possible
  • (\[My search Text\]) - a literal [My search Text]
  • .*(.|\n) - same as above
  • .+ - 1 or more characters other than a newline
  • </ul> - literal </ul>.

You can see that in this regex you do not really have a good multiline support. It is very inefficient due to lots of .* that require lots of backtracking.

I would install the HtmlAgilityPack and use the following method:

public List<string> HtmlAgilityPackGetTagOuterHTMLbyXpath(string html, string xpath)
{
    HtmlAgilityPack.HtmlDocument hap;
    var results = new List<string>();
    Uri uriResult;
    if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
    { // html is a URL 
        var doc = new HtmlAgilityPack.HtmlWeb();
        hap = doc.Load(uriResult.AbsoluteUri);
    }
    else
    { // html is a string
        hap = new HtmlAgilityPack.HtmlDocument();
        hap.LoadHtml(html);
    }
    var nodes = hap.DocumentNode.SelectNodes(xpath);
    if (nodes != null)
    {
       foreach (var node in nodes)
           results.Add(node.OuterHtml);
    }
    return results;
}

With one of these 2 XPaths that should return you 3 <ul> nodes:

//li[contains(., 'My search Text')]/ancestor::ul[1]
//ul[.//li[contains(., 'My search Text')]]

Like this:

var res = HtmlAgilityPackGetTagOuterHTMLbyXpath(s, "//li[contains(., 'My search Text')]/ancestor::ul[1]"");

enter image description here

Upvotes: 1

Related Questions