Reputation: 3455
I am looking for a way, using static analysis of two JavaScript functions, to tell if they are the same. Let me define multiple definitions of "the same".
Level 1: The functions are the same except for possible different whitespace, e.g. TABS, CR, LF and SPACES.
Level 2 The functions may have different whitespace like Level 1, but also may have different variable names.
Level 3 ???
For level one, I think I could just remove all (non-literal, which may be tough) whitespace from each string containing the two JS function definitions, and then compare the strings.
For level two, I think I would need to use something like SpiderMonkey's parser to generate a two parse trees, and then write a comparer which walks the trees and allows variables to have different names.
[Edit] Williham, makes a good point below. I do mean identical. Now, I'm looking for some practical strategies, particularly with regards to using parse trees.
Upvotes: 3
Views: 457
Reputation: 95392
See my company's (Semantic Designs) Smart Differencer tool. This family of tools parses source code according to compiler-level-detail grammar for the language of interest (in your case, JavaScript), builds ASTs, and then compares the ASTs (which effectively ignores whitespace and comments). Literal values are normalized, so it doesn't matter how they are "spelled"; 10E17 has the same normalized value as 1E18.
If the two trees are the same, it will tell you "no differences". If they differ by a consistent renaming of an identifier, the tool will tell you the consisten renaming and the block in which it occurs. Other differences are reported as language-element (identifier, statement, block, class,...) insertions, deletions, copies, or moves. The goal is to report the small set of deltas that plausibly explain the differences. You can see examples for a number of languages at the web site.
You can't in practice go much beyond this; to determine if two functions compute the same answer, in principle you have to solve the halting problem. You might be able to detect where two language elements that are elements of a commutative list, can be commuted without effect; we're working on this. You might be able to apply normalization rewrites to canonicalize certain forms (e.g., map all multiple declarations into a sequence of lexically sorted single declarations). You might be able to convert the source code into its equivalent set of dataflows, and do a graph isomorphism match (the Programmer's Apprentice from MIT back in the 1980's proposed to do this, but I don't think they ever got there).
All of there are more work to do than you might expect.
Upvotes: 1
Reputation: 29019
Reedit:
To expound on my suggestion for determining identical functions, the following flow can be suggested:
Level 1: Remove any whitespace that is not part of a string literal; insert newlines after each {
, ;
and }
and compare. If equal; the functions are identical, if not:
Level 2: Move all variable declarations and assignments that don't depend on the state of other variables defined in the same scope to the start of the scope they are declared in (or if not wanting to actually parse the JS; the start of the braces); and order them by line length; treating all variable names as being 4 characters long, and falling back to alphabetical ordering ignoring variable names in case of tied lengths. Reorder all collections in alphabetical order, and rename all variables vSNN
, where v is literal, S is the number of nested braces and NN is the order in which the variable was encountered.
Compare; if equal, the functions are identical, if not:
Level 3: Replace all string literals with "sNN"
, where "
and s
are literal, and NN
is the order in which the string was encountered. Compare; if equal, the functions are identical, if not:
Level 4: Normalize the names of any functions known to be the same by using the name of the function with the highest priority according to alphabetical order (in the example below, any calls to p_strlen()
would be replaced with c_strlen()
. Repeat re-orderings as per level 1 if necessary. Compare; if equal, the functions are identical, if not; the functions are almost certainly not identical.
Original answer:
I think you'll find that you mean "identical", not "the same".
The difference, as you'll find, is critical:
Two functions are identical if, following some manner of normalization, (removing non-literal whitespace, renaming and reordering variables to a normalized order, replacing string literals with placeholders, …) they compare to literally equal.
Two functions are the same if, when called for any given input value they give the same return value. Consider, in the general case, a programming language which has counted, zero-terminated strings (hybrid Pascal/C strings, if you will). A function p_strlen(str)
might look at the character count of the string and return that. A function c_strlen(str)
might count the number of characters in the string and return that.
While these functions certainly won't be identical, they will be the same: For any given (valid) input value they will give the same value.
My point is:
Determining wether two functions are identical (what you seem to want to achieve) is a (moderately) trivial problem, done as you describe.
Determining wether two functions are truly the same (what you might actually want to achieve) is non-trivial; in fact, it's downright Hard, probably related to the Halting Problem, and not something that can be done with static analysis.
Edit: Of course, functions that are identical are also the same; but in a highly specific and rarely useful way for complete analysis.
Upvotes: 3
Reputation: 8160
Your approach for level 1 seems reasonable.
For level 2, how about do some rudimentary variable substitution on each function and then do approch for level 1? Start at the beginning and for each variable declaration you encounter rename them to var1, var2, ... varX
.
This does not help if the functions declare variables in different orders... var i
and var j
may be used the same way in both functions but are declared in different orders. Then you are probably left doing a comparison of parse trees like you mention.
Upvotes: 1