Waseem Fastian
Waseem Fastian

Reputation: 33

Regex for Removing <a> tag text that is between <ul> and <li> C#

I have the following html, i tried many many regex to remove hperlink content/text that is between ul tag and li tag only, but not found any regex for removing a tag text . I want that , whenever a tag comes under in ul and li tag then i want to replace a tag text with empty string.

<ul id="foot.dir" class="content" >
 <li><a href="http://www.citysearch.com/aboutcitysearch/about_us"  name="search_grid.footer.1.aboutCs" rel="nofollow" id="foot.dir.about">About</a></li>
 <li><a href="http://www.citysearch.com/mobile-application" name="search_grid.footer.1.mobile" id="foot.dir.apps">Apps</a></li>
</ul>

i have tried this regex but it is not working, here input is string that contains html.

input = Regex.Replace(input, @"<ul[^>]*?><li><a[^>]*?>(?<option>.*?)</ul></li></a>", string.Empty);

Please help me out. Thank You

Upvotes: 1

Views: 1824

Answers (2)

Anirudha
Anirudha

Reputation: 32807

Regex is not a good choice for parsing HTML files..

HTML is not strict nor is it regular with its format..

Use htmlagilitypack

Regex is used for Regular expression

You can use this code to retrieve it using HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

foreach(var item in doc.DocumentNode.SelectNodes("//li[a]"))// select li only if it has anchor tag
{
    item.ParentNode.RemoveChild(item);//removed anchor tag
}
//dont forget to save

i want to remove tag text using regex only ..

Regex.Replace(input,@"(?<=<li[^>]*>)\s*<a.*?(?=</li>)","",RegexOptions.Singleline);

Upvotes: 1

Oded
Oded

Reputation: 499062

Regex is a poor choice for parsing HTML, in particular HTML that is not consistent.

I suggest using the HTML Agility Pack to parse and change the HTML.

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

The source download comes with a number of sample projects showing how to use the library.

Upvotes: 2

Related Questions