Alex
Alex

Reputation: 370

Java source code attribute counting for detection

Currently I am on a source code plagiarims detection project, and I actually use the different aspects of attributes of the input files (souce code files) to detect plagiarism among student assignments. For example, I now uses (number of identifiers/variables, number of methods used, number of lines of code ) and some other attributes to represent each source code file.

However, when I try to count the number of variables used, one problem is how to find out whether a variable has been used or not. Because the students could intentionally put some identifiers in to cover the plagiarism. However, as I tried to solve this, I found this one really tough. One approach to do this is to use Regular expression in java to handle finding identifiers, but after finding them, I stuck on how to check for usage or not. (What's more, after this, I still need to find whether a java method is called or not. ) So writing my own version of regular expression could be very complicated.

I know in some IDE like netbeans the editor could instantly find out whether a variable is used or not and underline it. So I wonder if there is any good way for checking variables used or not.

Any suggestions on how to do checking variables would be good!

Upvotes: 0

Views: 496

Answers (3)

flyx
flyx

Reputation: 39738

For doing this kind of code analysis, you absolutely have to look into parser / compiler tools. You cannot determine whether a variable is used by searching for its mere name; you have to search for correct context as well.

I suggest to have a look at ANTLR, which is a Java-based language parsing tool. It has a definition for parsing Java syntax available here. Don't expect to find an easy solution for your problem that can be implemented in a couple of hours.

Another Java-based tool is JavaCC. If you're looking for example code showing how these tools can be used, take a look at PMD, which uses a parser built with JavaCC to analyze Java code.

Another possibility is to write a plugin for an IDE that supports code analysis - you'd probably have a much simpler interface there to access the code structure, and as you said, lots of functionality is already available and can simply be called by your plugin.

Yes, you can probably also hack your way with some regexes. Whether you want to do this depends on how exact you want your tool to be. Without parsing the source code, deciding whether an occurrence of a variable name is actually a usage of that variable is merely a heuristic guess.

Upvotes: 1

TPete
TPete

Reputation: 2069

The IDEs classify occurences of variables into two categories: assignments to the particular variable and simple usage of it. An assignment should be easy to recognize using a regex. All the other occurences should be in code just using that variable.

Upvotes: 0

npinti
npinti

Reputation: 52185

First thing that comes to mind is to do something like so:

(\w+)\s+<?varname>(\w+)\s*(=[\w\s\(\,)]+)?;

This should match variable creation like so:

int x = 1;
double y;
Foo foo = new Foo(); 
Foo foo = new Foo(a,b,c);

To make things less complicated, it might be a good idea to replace all ; which are not between quotes by ;\n. This should make sure that you have one statement per line.

The regex provided, besides trying to match variable creation, also puts the name of the variable in a group named varname which you can access through a matcher object like so: String varName = matcher.group("varname");. To see if a variable is being used you can then check to see if the variable is on the right hand side of an equals, like so:

[^=]+\s*=\s*.*?x.*;

This should match strings such as int y = x; and Foo foo = x + y;

However, a variable can also be used as a method parameter, so you can do something like so:

.*?\(.*?x.*?\).*?;

This will match strings like so: foo(x); foo(a,b,c,x); Foo foo = new Foo(a,v,x,y).createNewFoo(); Foo foo = new Foo(a,v,x,y).SOMECONSTANT;

It is to be noted that in the regular expressions provided, x is just a sample variable name which should be replaced with the actual variable name which you will be able to extract through the use of the first regular expression.

You might want to give a look at this regex tutorial by Oracle.

Upvotes: 1

Related Questions