Reputation: 41
I am finding difficulties to understand
1) AST matching, how two AST's are similar? Are types included in the comparison/matching or only the operations like +, -, ++,...etc inlcuded?
2) Two statements are syntactically similar (This term I read it somewhere in a paper), can we say the below example that the two statement are syntactically similar?
int x = 1 + 2
String y = "1" + "2"
Java - Eclipse is what am using right now and trying to understand the AST for.
Best Regards,
Upvotes: 1
Views: 970
Reputation: 95334
What ASTs are:
An AST is a data structure representing program source text that consists of nodes that contain a node type, and possibly a literal value, and a list of children nodes. The node type corresponds to what OP calls "operations" (+, -, ...) but also includes language commands (do, if, assignment, call, ...) , declarations (int, struct, ...) and literals (number, string, boolean). [It is unclear what OP means by "type"]. (ASTs often have additional information in each node referring back to the point of origin in the source text file).
What ASTs are for:
OP seems puzzled by the presence of ASTs in Eclipse.
ASTs are used to represent program text in a form which is easier to interpret than the raw text. They provide a means to reason about the program structure or content; sometimes they are used to enable the modification of program ("refactoring") by modifying the AST for a program and then regenerating the text from the AST.
Comparing ASTs for similarity is not a really common use in my experience, except in clone detection and/or pattern matching.
Comparing ASTs:
Comparing ASTs for equality is easy: compare the root node type/literal value for equality; if not equal, the comparision is complete, else (recursively) compare the children nodes).
Comparing ASTs of similarity is harder because you must decide how to relax the equality comparision. In particular, you must decide on a precise definition of similarity. There are many ways to define this, some rather shallow syntactically, some more semantically sophisticated.
My paper Clone Detection Using Abstract Syntax Trees describes one way to do this, using similarity defined as the ratio of the number of nodes shared divided by the number of nodes total in both trees. Shared nodes are computed by comparing the trees top down to the point where some child is different. (The actual comparision is to compute an anti-unifier). This similary measure is rather shallow, but it works remarkably well in finding code clones in big software systems.
From that perspective, OPs's examples:
int x = 1 + 2
String y = "1" + "2"
have trees written as S-expressions:
(declaration_with_assignment (int x) (+ 1 2))
(declaration_with_assignment (String y) (+ "1" "2"))
These are not very similar; they only share a root node whose type is "declaration-with-assignment" and the top of the + subtree. (Here the node count is 12 with only 2 matching nodes for a similarity of 2/12).
These would be more similar:
int x = 1 + 2
float x = 1.0 + 2
(S-expressions)
(declaration_with_assignment (int x) (+ 1 2))
(declaration_with_assignment (float x) (+ 1.0 2))
which share the declaration with assignment, the add node, the literal leaf node 2, and arguably the literal nodes for integer 1 and float 1.0, depending on whether you wish to define them as "equal" or not, for a similarity of 4/12.
If you change one of the trees to be a pattern tree, in which some "leaves" are pattern variables, you can then use such pattern trees to find code that has certain structure.
The surface syntax pattern:
?type ?variable = 1 + ?expression
with S-expression
((declaration_with_assignment (?type ?varaible)) (+ 1 ?expression))
matches the first of OP's examples but not the second.
As far as I know, Eclipse doesn't offer any pattern-based matching abilities. But these are very useful in program analysis and/or program transformation tools. For some specific examples, too long to include here, see DMS Rewrite Rules
(Full disclosure: DMS is a product of my company. I'm the architect).
Upvotes: 4