Reputation:
How would one go about removing Comments from HTML files?
They may only take up a single line, however I'm sure I'll run into cases where a comment may span across multiple lines:
<!-- Single line comment. -->
<!-- Multi-
ple line comment.
Lots '""' ' " ` ~ |}{556 of !@#$%^&*()) lines
in
this
comme-
nt! -->
Upvotes: 9
Views: 7950
Reputation: 23521
Here is one more solution. This one works without library or regex. You can easily change the starting and ending delimiter.
public static string RemoveHtmlComments(string input)
{
var start = "<!--";
var end = "-->";
return RemoveBetweenAny(input, start, end);
}
private static string RemoveBetweenAny(string source, string start, string end)
{
string RemoveBetweenRec(string source, string start, string end, string result)
{
var startIndex = source.IndexOf(start);
if (startIndex < 0)
return result + source;
var head = source.Remove(source.IndexOf(start));
var endIndex = source.IndexOf(end);
if (endIndex < 0)
return result + head;
var tail = source.Remove(0, endIndex + end.Length);
return RemoveBetweenRec(tail, start, end, result + head);
}
return RemoveBetweenRec(source, start, end, "");
}
some unit tests with xUnit
[Theory]
[InlineData("", "")]
[InlineData("Hello World", "Hello World")]
[InlineData("He<!--llo -->World", "HeWorld")]
[InlineData(@"He<!--llo World", @"He")]
[InlineData("He<!--llo -->WorldHe<!--llo -->World", "HeWorldHeWorld")]
[InlineData("He<!--llo -->WorldHe<!--llo World", "HeWorldHe")]
public void TestRemoveComments(string input, string expected)
{
Assert.Equal(expected, RemoveHtmlComments(input));
}
Upvotes: 0
Reputation: 32484
Not the best solution out there but a simple on pass algo. should do the trick
List<string> output = new List<string>();
bool flag = true;
foreach ( string line in System.IO.File.ReadAllLines( "MyFile.html" )) {
int index = line.IndexOf( "<!--" );
if ( index > 0 ) {
output.Add( line.Substring( 0, index ));
flag = false;
}
if ( flag ) {
output.Add( line );
}
if ( line.Contains( "-->" )) {
output.Add( line.Substring( line.IndexOf( "-->" ) + 3 ));
flag = true;
}
}
System.IO.File.WriteAllLines( "MyOutput.html", output );
Upvotes: 3
Reputation: 1631
This function with minor tweaks should work :-
private string RemoveHTMLComments(string input)
{
string output = string.Empty;
string[] temp = System.Text.RegularExpressions.Regex.Split(input, "<!--");
foreach (string s in temp)
{
string str = string.Empty;
if (!s.Contains("-->"))
{
str = s;
}
else
{
str = s.Substring(s.IndexOf("-->") + 3);
}
if (str.Trim() != string.Empty)
{
output = output + str.Trim();
}
}
return output;
}
Not sure if its the best solution...
Upvotes: 4
Reputation: 138950
You could use the Html Agility Pack .NET library. Here is an article that explains how to use it on SO: How to use HTML Agility pack
This is the C# code to remove comments:
HtmlDocument doc = new HtmlDocument();
doc.Load("yourFile.htm");
// get all comment nodes using XPATH
foreach (HtmlNode comment in doc.DocumentNode.SelectNodes("//comment()"))
{
comment.ParentNode.RemoveChild(comment);
}
doc.Save(Console.Out); // displays doc w/o comments on console
Upvotes: 14