How to list all duplicate lines in multiple python scripts

Question

I have many python scripts in my project and some of them contain dicts/lists/sets like:

hi.py

hi_dict = {
    'de': 'hallo',
    'en': 'hello',
}

But some files contain similar or even the same dict:

maintenace/hello.py

class hello(superclass):
    hello_dict = {
        'en': 'hello',
        'fa': 'سلام',
    }

I would like to merge same purpose dicts into one (which could be imported from the other occurences). But first I need to find such dicts.

How can I search all my Python scripts (using Python or Unix terminal commands) for duplicate lines (in this case 'en': 'hello',), but exclude initial/trailing spaces and blank lines?

I've found many answers, how to find duplicate lines in already sorted text files, but none of them stated, how to treat unsorted and full-of-blank-lines Python scripts, also none of them said how to exclude initial/trailing spaces.

Note: I use git, so I can damage scripts in order to get the result in any way and then restore them back from last commit easily.

A solution to this could also find duplicate code, which I could merge and reduce code complexity, so definitely this could be applicable also to make better score on codeclimate or to make the whole framework faster.

Giacomo Alzetta · Accepted Answer

If you want a simple shell script you can do this:

strip leading and trailing whitespaces from all the lines
sort the lines
compare files using comm to get common lines between the files

So you'd do:

sed 's/^\s*//;s/\s*$//' first_file.py | sort > sorted_first_file.py
sed 's/^\s*//;s/\s*$//' second_file.py | sort > sorted_second_file.py
comm -12 sorted_first_file.py sorted_second_file.py

Now to compare every pair of files in your source tree you could:

first run the sed + sort command above and produce sorted_ versions of those files to reduce the time
then run a find command which, for every filename, runs a find command and compares that file to all other files.

Something along the lines of:

# remove all leading/trailing spaces
find . -name '*.py' -exec sed -i.no_spaces 's/^\s*//;s/\s*$//' {} \;

# on my machine, for some reason, only when using -i the above sed command
# does not work and I have to split it in two.
#find . -name '*.py' -exec sed -i.no_spaces 's/^\s*//' {} \;
#find . -name '*.py.no_spaces' -exec sed -i 's/\s*$//' {} \;

# sort all files
find . -name '*.py.no_spaces' -exec sort {} -o {}.sorted \;

And then the final step is the "double find":

for filename in $(find . -name '*.py.no_spaces.sorted');
do
    find . -name '*.py.no_spaces.sorted' -not -path "*$filename*" -exec comm -12 "$filename" {} \;
done

This should output the common lines between all files.

Note: you probably want to remove empty lines too. You can do this with a grep before this final step.

NOTE: if you have a relatively large number of files this will take forever to finish. Since the algorithm is O(n^2), so if you have 1000 files it does 1000000 calls to comm -12.

How to list all duplicate lines in multiple python scripts

Answers (2)

Related Questions