Ken
Ken

Reputation: 1665

Is it possible to write a regex which checks if a string (javascript & php code) is minified?

Is it possible to write a regular expression which checks if a string (some code) is minified?

Many PHP/JS obfuscators remove white space chars (among other things). So, the final minified code sometimes looks like this:

PHP:
$a=array();if(is_array($a)){echo'ok';}

JS:
a=[];if(typeof(a)=='object'&&(a instanceof Array){alert('ok')}

in both cases there are no space chars before and after "{", "}", ";", etc. There also some other patterns which can help. I am not expecting a high accuracy regex, just need one which checks if at least 100 chars of string looks like minified code. Thanks in advice.

PURPOSES: web malware scanner

Upvotes: 0

Views: 303

Answers (5)

Chris Dennett
Chris Dennett

Reputation: 22721

Run it through a parser for that particular language (even a prettifier might work fine) and modify it to count the number of unused characters. Use the percentage of unused chars vs. number of chars in documents as a test for minification. I don't think you can do this accurately with regex, although counting whitespace vs. document content might be okay.

Upvotes: 0

kapex
kapex

Reputation: 29969

You can't tell if it's got minified or just written like that by hand (probably only applies for smaller scripts). But you can check if it doesn't contain unnecessary whitespace.

Take a look at open source obfuscator/minifier and see what rules they use to remove the whitespace. Validating if those rules were applied should work, if regex get to complex, a simple parser might be needed.

Just make sure that string literals like a="if ( b )" are excluded.

Upvotes: 0

Jonathan Hall
Jonathan Hall

Reputation: 79604

The short answer is "no", regex cannot do this.

Your best bet will probably be to do a statistical analysis of the source files, and compare against some known heuristics. For instance, by comparing the variable names against those often found in minimized code. A minimized file probably has a lot of one-character variable names, for instance... and won't have two-character variable names until all the one-character variable names are exhausted... etc.

Another option would be simply to run the source file through a minimizer, and see if the output is sufficiently different from the input. If not, it was probably already minimized.

But I have to agree with sg3s's final sentence: If you can explain why you need this, we can probably provide more useful answers to your actual needs.

Upvotes: 2

Paul
Paul

Reputation: 141827

I think a minifier will strip all newline characters, although there might possibly be one at the end of the file still if the minified code was pasted back in a text editor. Something like this will probably be fairly accurate:

/^[^\n\r]+(\r\n?|\n)?$/

That just tests that there are no newline characters in the whole thing except for possibly one at the end. So no guarantees, but I think it will work well on any longish block of code.

Upvotes: 2

sg3s
sg3s

Reputation: 9567

No. Since the syntax/code and its intention doesn't change and some people who're very familiar with the php and/or js will write simple functions on one line without any whitespace at all (me :s).

What you could do is count all the whitespace characters in a string though this would also be unreliable since for some stuff you simply need whitespace, like x instanceof y heh. Also not all code is minified and cramped into a single row (see jQuery UI) so you can't really count on that either....

Maybe you can explain why you need to know this and we can try and find an alternative?

Upvotes: 0

Related Questions