tweedledum11
tweedledum11

Reputation: 121

Remove redundant html tags from string

I various string objects with html formatted text. Some of these strings contain certain tags at the end that I want to remove programmatically, like these linebreak and paragraph tags at the end:

<li><ol>  **Text/List**  </li></ol><p><br></p><br><br>

I need to check the string from its endpoint, but I can't figure out where to cut the end off, or where to look for the cutting point. I just need to get rid of these redundant tags.

I tried to build a function that checks the string, I know it doesn't work properly, but it's my basis:

public static String RemoveRedundantTags(this String baseString, String html)
    {
        if (html.Contains("<"))
        {
            for (Int32 i = html.Length - 1; i >= 1; i--)
            {
                if (html[i] == '<' && html[i - 1] != '>' && html[i + 1] != '/')
                {
                    redundantTags = html.Substring(html[i], html.Length - i);

                    html = html.Replace(redundantTags, String.Empty);

                    return html;
                }
            }
        }

        return html;
    }

Upvotes: 1

Views: 448

Answers (1)

Tim Schmelter
Tim Schmelter

Reputation: 460038

If i'd need to manipulate HTML, i'd use a HTML-parser like HtmlAgilityPack, not string methods or regex. Here is an example that removes all br from the end:

string html = "<li><ol>  **Text/List**  </li></ol><p><br></p><br><br>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var brToRemove = doc.DocumentNode.Descendants().Reverse().TakeWhile(n => n.Name == "br");
foreach (HtmlNode node in brToRemove)
    node.Remove();

using (StringWriter writer = new StringWriter())
{
    doc.Save(writer);
    string result = writer.ToString();
}

The result is:

<li><ol>  **Text/List**  </ol></li><p>

As you can see by default it fixes parse errors by itself. There was one:

Start tag <ol> was not found


If the html was

html = "<ol><li>TEXT</li></ol><p><br></p><p><br></p>&nbsp;";

and you wanted to remove all <p> and <br> tags but also the &nbsp; from the end as commented. You could use following approach that uses a dictionary where the key is the tag-name and the value are the strings of the inner-text of this tag, so a sub-selector. If the value is an empty sequence the tag should be removed no matter what inner-text it has. Here is a dictionary for your requirement:

var tagsToRemove = new Dictionary<string, IEnumerable<string>>
{
    { "br", Enumerable.Empty<string>() }, { "p", Enumerable.Empty<string>() }, { "#text", new[] { "&nbsp;" } }
};

Now the LINQ query to find all tags to remove is:

var brToRemove = doc.DocumentNode.Descendants()
    .Reverse()
    .TakeWhile(n => tagsToRemove.ContainsKey(n.Name) 
                 && tagsToRemove[n.Name].DefaultIfEmpty(n.InnerText).Contains(n.InnerText));

The (desired) result is:

<ol><li>TEXT</li></ol>

Upvotes: 2

Related Questions