Spark machine learning AST comparison

Question

I'm new to machine learning but am trying to research if it's possible to use machine learning to compare two ASTs (Abstract Syntax Tree) of a source code to find out if they are similar or not.

Ideally, do some traning using some dataset and comare any two given ASTs to find the similarities.

Any suggestions here?

WestCoastProjects · Accepted Answer

It seems you were expecting that a machine learning algorithm would do the heavy lifting of discovering the relative "distance" between two AST's. That is unikely. Instead you might consider the overall structure of the two trees: do they have similar numbers of nodes at each level of the tree. If they do - for a significant majority of the tree at least - then you might wish to define one of two approaches for "distance metrics" between two trees:

number of different node values
relative difference of the node values - maybe a traditional Levenshtein distance .. but more likely a comparison that understands the semantics of the particular language to see how simlar they are. E.g. being able to understand that two structures represent the same statement but potentially with unimportant whitespace or other formatting differences. Or maybe different variable names but identical semantics.
Another additional check may be defining how many subtrees are identical. Then for the diverging subtrees define a spatial metric able to find structural similarities/differences only for that subtree

The summary is: "nothing out of the box for the entire problem - but you can leverage existing ideas/algorithms for particular localized cases".

Spark machine learning AST comparison

Answers (1)

Related Questions