user3754804
user3754804

Reputation: 33

How to strip comments from Javascript using PHP

I want to remove the comments from these kind of scripts:

var stName = "MyName"; //I WANT THIS COMMENT TO BE REMOVED
var stLink = "http://domain.com/mydomain";
var stCountry = "United State of America";

What is (the best) ways of accomplish this using PHP?

Upvotes: 0

Views: 659

Answers (2)

Niklaus
Niklaus

Reputation: 991

I would go with preg_replace(). Assuming all comments are single line comments (// Comment here) you can start with this:

$JsCode = 'var stName = "MyName isn\'t \"Foobar\""; //I WANT THIS COMMENT TO BE REMOVED
var stLink = "http://domain.com/mydomain"; // Comment
var stLink2 = \'http://domain.com/mydomain\'; // This comment goes as well
var stCountry = "United State of America"; // Comment here';

$RegEx = '/(["\']((?>[^"\']+)|(?R))*?(?<!\\\\)["\'])(.*?)\/\/.*$/m';
echo preg_replace($RegEx, '$1$3', $JsCode);

Output:

var stName = "MyName isn't \"Foobar\""; 
var stLink = "http://domain.com/mydomain"; 
var stLink2 = 'http://domain.com/mydomain'; 
var stCountry = "United State of America"; 

This solution is far from perfect and might have issues with strings containing "//" in them.

Upvotes: 0

Siguza
Siguza

Reputation: 23850

The best way is to use an actual parser or write at least a lexer yourself.
The problem with Regex is that it gets enormously complex if you take everything into account that you have to.
For example, Cagatay Ulubay's suggested Regex'es /\/\/[^\n]?/ and /\/\*(.*)\*\// will match comments, but they will also match a lot more, like

var a = '/* the contents of this string will be matches */';
var b = '// and here you will even get a syntax error, because the entire rest of the line is removed';
var c = 'and actually, the regex that matches multiline comments will span across lines, removing everything between the first "/*" and here: */';
/*
   this comment, however, will not be matched.
*/

While it is rather unlikely that strings contain such sequences, the problem is real with inline regex:

var regex = /^something.*/; // You see the fake "*/" here?

The current scope matters a lot, and you can't possibly know the current scope unless you parse the script from the beginning, character for character.
So you essentially need to build a lexer.
You need to split the code into three different sections:

  • Normal code, which you need to output again, and where the start of a comment could be just one character away.
  • Comments, which you discard.
  • Literals, which you also need to output, but where a comment cannot start.

Now the only literals I can think of are strings (single- and double-quoted), inline regex and template strings (backticks), but those might not be all.
And of course you also have to take escape sequences inside those literals into account, because you might encounter an inline regex like

/^file:\/\/\/*.+/

in which a single-character based lexer would only see the regex /^file:\/ and incorrectly parse the following /*.+ as the start of a multiline comment.
Therefore upon encountering the second /, you have to look back and check if the last character you passed was a \. The same goes for all kinds of quotes for strings.

Upvotes: 2

Related Questions