Finding similar code sections using sub-trees

Question

I have been reading this paper titled Clone Detection using Abstract Syntax Trees by Ira D. Baxter et al. There is a paragraph from the paper that I reproduced below:

In principle, finding sub-tree clones is easy: compare every subtree to every other sub-tree for equality. In practice, several problems arise: near-miss clone detection, sub-clones and scale. ...

When locating near-miss clones, hashing on complete subtrees fails precisely because good hashing functions include all elements of the tree, and thus sorts tress with minor differences into different buckets. We solved this problem by choosing an artificially bad hash function. This function must be characterized in such a way that the main properties one wants to find on near-miss clones are preserved. Near miss clones are usually created by copy and paste procedures followed by small modifications. These modifications usually generate small changes to the shape of the tree associated with the copied piece of code. Therefore, we argue that this kind of near-miss clone often have only some different small sub-trees. Based on this observation, a hash function that ignores small sub-trees is a goodchoice. In the experiment presented here, we used a hash function that ignores only the identifier names (leaves in the tree). Thus our hashing function puts trees which are similar modulo identifiers into the same hash bins for comparison.

I am trying to implement the techniques discussed in this paper but am stuck in trying to understand this one paragraph (that is unfortunately at the beginning of the paper). I understand what the paragraph is saying but the authors do not mention what hash function to choose or how to actually hash the ASTs. Can someone please explain this with a simple example from an implementation standpoint?

Ira Baxter · Accepted Answer

Shades that the author himself should answer. Isn't StackOverflow great :-?

The point of hash functions is that which one you choose doesn't matter, as long as it distributes input values evenly across a large number of buckets. You need a hash function that can be applied to the entire tree; the usual technique for such is to serialize the tree in any way possible (say, by an in-order tree visit) and then apply the hash function to the stream of values (tree nodes) this produces. (This idea is from the compiler literature on detecting common subexpressions, which was the inspiration for the original CloneDR). If this isn't clear, you need to spend more energy understanding how hash functions are applied to complex data structures. Wikipedia on hashing is a good place to start; if that's not enough, you need to find a book on algorithms and study up.

What you feed to the hash function is up to you. The point I made in the paper is that you can compute hash functions that ignore the identifier leaves of an AST, which will cause trees having the same identical structure but different identifiers to hash to the same bucket. Thus, trees which are similar modulo identifiers are easily matched, beause they occur in the same hash bucket.

Of course, there's a lot more to the whole clone detection algorithm that just matching trees modulo identifiers. You need to worry about matching parameterized sequences (which is sort of the big point in the paper), reporting the results, and of course you need a high-quality language parser for whatever language you care to apply this to.

You can see results of the CloneDR for a number of different languages.

Finding similar code sections using sub-trees

Answers (2)

Related Questions