Reputation: 33
I want to remove the comments from these kind of scripts:
var stName = "MyName"; //I WANT THIS COMMENT TO BE REMOVED
var stLink = "http://domain.com/mydomain";
var stCountry = "United State of America";
What is (the best) ways of accomplish this using PHP?
Upvotes: 0
Views: 659
Reputation: 991
I would go with preg_replace(). Assuming all comments are single line comments (// Comment here) you can start with this:
$JsCode = 'var stName = "MyName isn\'t \"Foobar\""; //I WANT THIS COMMENT TO BE REMOVED
var stLink = "http://domain.com/mydomain"; // Comment
var stLink2 = \'http://domain.com/mydomain\'; // This comment goes as well
var stCountry = "United State of America"; // Comment here';
$RegEx = '/(["\']((?>[^"\']+)|(?R))*?(?<!\\\\)["\'])(.*?)\/\/.*$/m';
echo preg_replace($RegEx, '$1$3', $JsCode);
Output:
var stName = "MyName isn't \"Foobar\"";
var stLink = "http://domain.com/mydomain";
var stLink2 = 'http://domain.com/mydomain';
var stCountry = "United State of America";
This solution is far from perfect and might have issues with strings containing "//" in them.
Upvotes: 0
Reputation: 23850
The best way is to use an actual parser or write at least a lexer yourself.
The problem with Regex is that it gets enormously complex if you take everything into account that you have to.
For example, Cagatay Ulubay's suggested Regex'es /\/\/[^\n]?/
and /\/\*(.*)\*\//
will match comments, but they will also match a lot more, like
var a = '/* the contents of this string will be matches */';
var b = '// and here you will even get a syntax error, because the entire rest of the line is removed';
var c = 'and actually, the regex that matches multiline comments will span across lines, removing everything between the first "/*" and here: */';
/*
this comment, however, will not be matched.
*/
While it is rather unlikely that strings contain such sequences, the problem is real with inline regex:
var regex = /^something.*/; // You see the fake "*/" here?
The current scope matters a lot, and you can't possibly know the current scope unless you parse the script from the beginning, character for character.
So you essentially need to build a lexer.
You need to split the code into three different sections:
Now the only literals I can think of are strings (single- and double-quoted), inline regex and template strings (backticks), but those might not be all.
And of course you also have to take escape sequences inside those literals into account, because you might encounter an inline regex like
/^file:\/\/\/*.+/
in which a single-character based lexer would only see the regex /^file:\/
and incorrectly parse the following /*.+
as the start of a multiline comment.
Therefore upon encountering the second /
, you have to look back and check if the last character you passed was a \
. The same goes for all kinds of quotes for strings.
Upvotes: 2