nowox
nowox

Reputation: 29116

How to detect the amount of almost-repetition in a text file?

I am a programming teacher, and I would like to write a script that detects the amount of repetition in a C/C++/Python file. I guess I can treat any file as pure text.

The script's output would be the number of similar sequences that repeat. Eventually, I am only interested in a DRY's metric (how much the code satisfied the DRY principle).

Naively I tried to do a simple autocorrelation but it would be hard to find the proper threshold.

u = open("find.c").read()
v = [ord(x) for x in u]
y = np.correlate(v, v, mode="same")
y = y[: int(len(y) / 2)]

x = range(len(y))
z = np.polyval(np.polyfit(x, y, 3), x)

f = (y - z)[: -5]
plt.plot(f)
plt.show();

enter image description here

So I am looking at different strategies... I also tried to compare the similarities between each line, each group of 2 lines, each group of 3 lines ...

import difflib
import numpy as np

lines = open("b.txt").readlines()
lines = [line.strip() for line in lines]

n = 3
d = []
for i in range(len(lines)):
    a = lines[i:i+n]
    
    for j in range(len(lines)):
        b = lines[j:j+n]
        if i == j: continue # skip same line
        group_size = np.sum([len(x) for x in a])
        if group_size < 5: continue # skip short lines
        ratio = 0
        for u, v in zip(a, b):
            r = difflib.SequenceMatcher(None, u, v).ratio()
            ratio += r if r > 0.7 else 0        
        d.append(ratio)
        
dry = sum(d) / len(lines)

In the following, we can identify some repetition at a glance:

w = int(len(d) / 100)
e = np.convolve(d, np.ones(w), "valid") / w * 10
plt.plot(range(len(d)), d, range(len(e)), e)
plt.show()

enter image description here

Why not using:

d = np.exp(np.array(d))

enter image description here

Thus, difflib module looks promising, the SequenceMatcher does some magic (Levenshtein?), but I would need some magic constants as well (0.7)... However, this code is > O(n^2) and runs very slowly for long files.

What is funny is that the amount of repetition is quite easily identified with attentive eyes (sorry to this student for having taken his code as a good bad example):

enter image description here

I am sure there is a more clever solution out there.

Any hint?

Upvotes: 9

Views: 1196

Answers (5)

bob
bob

Reputation: 434

Frame challenge: I’m not sure you should do this

It’d be a fun programming challenge for yourself, but if you intend to use it as a teaching tool—-I’m not sure I would. There’s not a good definition of “repeat” from the DRY principle that would be easy to test for fully in a computer program. The human definition, which I’d say is basically “failure to properly abstract your code at an appropriate level, manifested via some type of repetition of code, whether repeating exact blocks of whether repeating the same idea over and over again, or somewhere in between” isn’t something I think anyone will be able to get working well enough at this time to use as a tool that teaches good habits with respect to DRY without confusing the student or teaching bad habits too. For now I’d argue this is a job for humans because it’s easy for us and hard for computers, at least for now…

That said if you want to give it a try, first define for yourself requirements for what errors you want to catch, what they’ll look like, and what good code looks like, and then define acceptable false positive and false negative rates and test your code on a wide variety of representative inputs, validating your code against human judgement to see if it performs well enough for your intended use. But I’m guessing you’re really looking for more than simple repetition of tokens, and if you want to have a chance at succeeding I think you need to clearly define what you’re looking for and how you’ll measure success and then validate your code. A teaching tool can do great harm if it doesn’t actually teach the correct lesson. For example if your tool simply encourages students to obfuscate their code so it doesn’t get flagged as violating DRY, or if the tool doesn’t flag bad code so the student assumes it’s ok. Or if it flags code that is actually very well written.

More specifically, what types of repetition are ok and what aren’t? Is it good or bad to use “if” or “for” or other syntax repeatedly in code? Is it ok for variables and functions/methods to have names with common substrings (e.g. average_age, average_salary, etc.?). How many times is repetition ok before abstraction should happen, and when it does what kind of abstraction is needed and at what level (e.g. a simple method, or a functor, or a whole other class, or a whole other module?). Is more abstraction always better or is perfect sometimes the enemy of on time on budget? This is a really interesting problem, but it’s also a very hard problem, and honestly I think a research problem, which is the reason for my frame challenge.

Edit: Or if you definitely want to try this anyway, you can make it a teaching tool--not necessarily as you may have intended, but rather by showing your students your adherence to DRY in the code you write when creating your tool, and by introducing them to the nuances of DRY and the shortcomings of automated code quality assessment by being transparent with them about the limitations of your quality assessment tool. What I wouldn’t do is use it like some professors use plagiarism detection tools, as a digital oracle whose assessment of the quality of the students’ code is unquestioned. That approach is likely to cause more harm than good toward the students.

Upvotes: 1

Davis Herring
Davis Herring

Reputation: 40013

You can treat this as a variant of the longest common subsequence problem between the input and itself, where the trivial matching of each element with itself is disallowed. This retains the optimal substructure of the standard algorithm, since it can be phrased as a non-transitive “equality” and the algorithm never relies on transitivity.

As such, we can write this trivial implementation:

import operator

class Repeat:
  def __init__(self,l):
    self.l=list(l)
    self.memo={}

  def __call__(self,m,n):
    l=self.l
    memo=self.memo
    k=m,n
    ret=memo.get(k)
    if not ret:
      if not m or not n: ret=0,None
      elif m!=n and l[m-1]==l[n-1]:  # critical change here!
        z,tail=self(m-1,n-1)
        ret=z+1,((m-1,n-1),tail)
      else: ret=max(self(m-1,n),self(m,n-1),key=operator.itemgetter(0))
      memo[k]=ret
    return ret

  def go(self):
    n=len(self.l)
    v=self(n,n)[1]
    ret=[]
    while v:
      x,v=v
      ret.append(x)
    ret.reverse()
    return ret

def repeat(l): return Repeat(l).go()

You might want to canonicalize lines of code by removing whitespace (except perhaps between letters), removing comments, and/or replacing each unique identifier with a standardized label. You might also want to omit trivial lines like } in C/C++ to reduce noise. Finally, the symmetry should allow only cases with, say, m>=n to be treated.

Of course, there are also "real" answers and real research on this issue!

Upvotes: 2

igrinis
igrinis

Reputation: 13656

I suggest the following approach: let's say that repetitions should be at least 3 lines long. Then we hash every 3 lines. If hash repeats, then we write down the line number where it occurred. All is left is to join together adjacent duplicated line numbers to get longer sequences.

For example, if you have duplicate blocks on lines 100-103 and 200-203, you will get {HASH1:(100,200), HASH2:(101,201)} (lines 100-102 and 200-202 will produce the same value HASH1, and HASH2 covers lines 101-103 and 201-203). When you join the results, it will produce a sequence (100,101,200,201). Finding in that monotonic subsequences, you will get ((100,101), (200,201)).

As no loops used, the time complexity is linear (both hashing and dictionary insertion are O(n))

Algorithm:

  • read text line by line
  • transform it by
    • removing blanks
    • removing empty lines, saving mapping to original for the future
  • for each 3 transformed lines, join them and calculate hash on it
  • filter lines for which their hashes occur more then once (these are repetitions of at least 3 lines)
  • find longest sequences and present repetitive text

Code:

from itertools import groupby, cycle
import re

def sequences(l):
    x2 = cycle(l)
    next(x2)
    grps = groupby(l, key=lambda j: j + 1 == next(x2))
    yield from (tuple(v) + (next((next(grps)[1])),) for k,v in grps if k)

with open('program.cpp') as fp:
    text = fp.readlines()

# remove white spaces
processed, text_map = [], {}
proc_ix = 0
for ix, line in enumerate(text):
    line = re.sub(r"\s+", "", line, flags=re.UNICODE)
    if line:
        processed.append(line)
        text_map[proc_ix] = ix
        proc_ix += 1

# calc hashes
hashes, hpos = [], {}
for ix in range(len(processed)-2):
    h = hash(''.join(processed[ix:ix+3])) # join 3 lines 
    hashes.append(h)
    hpos.setdefault(h, []).append(ix)  # this list will reflect lines that are duplicated

# filter duplicated three liners 
seqs = []
for k, v in hpos.items():
    if len(v) > 1:
        seqs.extend(v) 
seqs = sorted(list(set(seqs)))

# find longer sequences
result = {}
for seq in sequences(seqs):
    result.setdefault(hashes[seq[0]], []).append((text_map[seq[0]], text_map[seq[-1]+3]))

print('Duplicates found:')
for v in result.values():
    print('-'*20)
    vbeg, vend = v[0]
    print(''.join(text[vbeg:vend]))
    print(f'Found {len(v)} duplicates, lines')
    for line_numbers in v:
        print(f'{1+line_numbers[0]} : {line_numbers[1]}')

Upvotes: 0

Carlos
Carlos

Reputation: 6021

I would build a system based on compressibility, because that is essentially what things being repeated means. Modern compression algorithms are already looking for how to reduce repetition, so let's piggy back on that work.

  • Things that are similar will compress well under any reasonable compression algorithm, eg LZ. Under the hood a compression algo is a text with references to itself, which you might be able to pull out.

  • Write a program that feeds lines [0:n] into the compression algorithm, compare it to the output length with [0:n+1].

  • When you see the incremental length of the compressed output increases by a lot less than the incremental input, you note down that you potentially have a DRY candidate at that location, plus if you can figure out the format, you can see what previous text it was deemed similar to.

  • If you can figure out the compression format, you don't need to rely on the "size doesn't grow as much" heuristic, you can just pull out the references directly.

  • If needed, you can find similar structures with different names by pre-processing the input, for instance by normalizing the names. However I foresee this getting a bit messy, so it's a v2 feature. Pre-processing can also be used to normalize the formatting.

Upvotes: 4

BNazaruk
BNazaruk

Reputation: 8111

Looks like you're choosing a long path. I wouldn't go there.

I would look into trying to minify the code before analyzing it. To completely remove any influence of variable names, extra spacing, formatting and even slight logic reshuffling.

Another approach would be comparing byte-code of the students. But it may be not a very good idea since the result will likely have to be additionally cleaned up.

Dis would be an interesting option.

I would, most likely, stop on comparing their AST. But ast is likely to give false positives for short functions. Cuz their structure may be too similar, so consider checking short functions with something else, something trivial.

On top of thaaaat, I would consider using Levenshtein distance or something similar to numerically calculate the differences between byte-codes/sources/ast/dis of the students. This would be what? Almost O(N^2)? Shouldn't matter.

Or, if needed, make it more complex and calculate the distance between each function of student A and each function of student B, highlighting cases when the distance is too short. It may be not needed though.

With simplification and normalization of the input, more algorithms should start returning good results. If a student is good enough to take someone's code and reshuffle not only the variables, but the logic and maybe even improve the algo, then this student understands the code well enough to defend it and use it with no help in future. I guess, that's the kind of help a teacher would want to be exchanged between students.

Upvotes: 3

Related Questions