anon271334
anon271334

Reputation:

Removing HTML Comments

How would one go about removing Comments from HTML files?

They may only take up a single line, however I'm sure I'll run into cases where a comment may span across multiple lines:

<!-- Single line comment. -->

<!-- Multi-
ple line comment.
Lots      '""' '  "  ` ~ |}{556             of      !@#$%^&*())        lines
in
this
comme-
nt! -->

Upvotes: 9

Views: 7950

Answers (4)

aloisdg
aloisdg

Reputation: 23521

Here is one more solution. This one works without library or regex. You can easily change the starting and ending delimiter.

public static string RemoveHtmlComments(string input)
{
    var start = "<!--";
    var end = "-->";
    return RemoveBetweenAny(input, start, end);
}

private static string RemoveBetweenAny(string source, string start, string end)
{
    string RemoveBetweenRec(string source, string start, string end, string result)
    {
        var startIndex = source.IndexOf(start);
        if (startIndex < 0)
            return result + source;
        var head = source.Remove(source.IndexOf(start));
        var endIndex = source.IndexOf(end);
        if (endIndex < 0)
            return result + head;
        var tail = source.Remove(0, endIndex + end.Length);
        return RemoveBetweenRec(tail, start, end, result + head);
    }
    
    return RemoveBetweenRec(source, start, end, "");
}

some unit tests with xUnit

[Theory]
[InlineData("", "")]
[InlineData("Hello World", "Hello World")]
[InlineData("He<!--llo -->World", "HeWorld")]
[InlineData(@"He<!--llo World", @"He")]
[InlineData("He<!--llo -->WorldHe<!--llo -->World", "HeWorldHeWorld")]
[InlineData("He<!--llo -->WorldHe<!--llo World", "HeWorldHe")]
public void TestRemoveComments(string input, string expected)
{
    Assert.Equal(expected, RemoveHtmlComments(input));
}

Upvotes: 0

zellio
zellio

Reputation: 32484

Not the best solution out there but a simple on pass algo. should do the trick

List<string> output = new List<string>();

bool flag = true;
foreach ( string line in System.IO.File.ReadAllLines( "MyFile.html" )) {
    
    int index = line.IndexOf( "<!--" );

    if ( index > 0 ) {
        output.Add( line.Substring( 0, index ));
        flag = false;
    }
    
    if ( flag ) {
        output.Add( line );
    }
    
    if ( line.Contains( "-->" )) {
       output.Add( line.Substring( line.IndexOf( "-->" ) + 3 )); 
       flag = true;
   }
}
 
System.IO.File.WriteAllLines( "MyOutput.html", output ); 

Upvotes: 3

Ankush Roy
Ankush Roy

Reputation: 1631

This function with minor tweaks should work :-

 private string RemoveHTMLComments(string input)
    {
        string output = string.Empty;
        string[] temp = System.Text.RegularExpressions.Regex.Split(input, "<!--");
        foreach (string s in temp)
        {
            string str = string.Empty;
            if (!s.Contains("-->"))
            {
                str = s;
            }
            else
            {
                str = s.Substring(s.IndexOf("-->") + 3);
            }
            if (str.Trim() != string.Empty)
            {
                output = output + str.Trim();
            }
        }
        return output;
    }

Not sure if its the best solution...

Upvotes: 4

Simon Mourier
Simon Mourier

Reputation: 138950

You could use the Html Agility Pack .NET library. Here is an article that explains how to use it on SO: How to use HTML Agility pack

This is the C# code to remove comments:

    HtmlDocument doc = new HtmlDocument();
    doc.Load("yourFile.htm");

    // get all comment nodes using XPATH
    foreach (HtmlNode comment in doc.DocumentNode.SelectNodes("//comment()"))
    {
        comment.ParentNode.RemoveChild(comment);
    }
    doc.Save(Console.Out); // displays doc w/o comments on console

Upvotes: 14

Related Questions