IPValverde
IPValverde

Reputation: 2029

How to obfuscate C++ variables and functions

I'm trying to do some algorithm comparison for plagiarism. I've found many TEXT comparison for plagiarism.

But in an algorithm it's very different. Let's say that some algorithm uses an huge number of variables, functions and user defined structures. If some guy copy the source code from someone, he'll at least, change the variables and functions names. With an simple text comparison algorithm this difference in functions and variables letters will count as an "difference" making the algorithm gives an "false" for plagiarism.

What I want to do is "generalize" (I don't know if that's the right word) all the variables, functions and user-defined structures names in an C++ source code. So the varibles will be named like "a", "b", the same for functions "... fa(...)", "... fb(...)". I have the c++ source algorithms in strings variables in PHP to be compared.

I know that many other things should be analysed for an accurate source code comparison, but that will be enough to me.

Upvotes: 3

Views: 1587

Answers (2)

Throwback1986
Throwback1986

Reputation: 6005

I've used MOSS in the past: http://theory.stanford.edu/~aiken/moss/ to detect plagiarized code. Since it works on a semantic level, it will detect the situations you presented above. The tool is language-aware, so comments are not considered in the analysis, and it goes a long way in detecting code that has been modified through simple search-and-replace of variable and/or function names.

Note: I used the tool a few years ago when I taught computer science in grad school, and it worked wonderfully in detecting code that had been yanked from the internet. Here is a well-documented account of similar application: http://fie2012.org/sites/fie2012.org/history/fie99/papers/1110.pdf

If you google "measure software similarity", you should find a few more useful hits: http://www.ics.heacademy.ac.uk/resources/assessment/plagiarism/detectiontools_sourcecode.html

Upvotes: 0

Sideshow Bob
Sideshow Bob

Reputation: 4716

It's an interesting question. Depending on how complex the algorithm, however, it might be that variable names are what gives the plagiarism away. How many ways can you really code up a tree traversal for example?

I think there was a paper a few years ago on identifying coders through their style - looking at all the little things like whitespace, where {}s are placed, etc. Who knows but maybe that is the way to go, look for a negative match to the student's previous style rather than positive match to the known sources. Saying that, students aren't likely to have developed a very personal coding style at an early stage of learning.

One thought - what language are the examples written in? Can it be compiled? If you compile C and then do a binary comparison on the executables, then will identical programs with different local variable names have the exact same binary? (Global vars and functions wouldn't, though).

Upvotes: 1

Related Questions