aleskva
aleskva

Reputation: 1805

How to list all duplicate lines in multiple python scripts

I have many python scripts in my project and some of them contain dicts/lists/sets like:

hi.py

hi_dict = {
    'de': 'hallo',
    'en': 'hello',
}

But some files contain similar or even the same dict:

maintenace/hello.py

class hello(superclass):
    hello_dict = {
        'en': 'hello',
        'fa': 'سلام',
    }

I would like to merge same purpose dicts into one (which could be imported from the other occurences). But first I need to find such dicts.

How can I search all my Python scripts (using Python or Unix terminal commands) for duplicate lines (in this case 'en': 'hello',), but exclude initial/trailing spaces and blank lines?

I've found many answers, how to find duplicate lines in already sorted text files, but none of them stated, how to treat unsorted and full-of-blank-lines Python scripts, also none of them said how to exclude initial/trailing spaces.

Note: I use git, so I can damage scripts in order to get the result in any way and then restore them back from last commit easily.

A solution to this could also find duplicate code, which I could merge and reduce code complexity, so definitely this could be applicable also to make better score on codeclimate or to make the whole framework faster.

Upvotes: 1

Views: 70

Answers (2)

Walter A
Walter A

Reputation: 20012

When the dict part always start with a line ending with dict = {, you can select the dict definitions first (perhaps change the search).
In the dict part you are only interested in lines with a :, so print those.
Remove the formatting, so remove spaces and tabs (perhaps also the ,). You can sort (and uniq -c) them.

sed -n '/dict = {/,/}/ s/:/:/p' inputfiles|tr -d ' \t' | sort

Result is something like

'de':'hallo',
'en':'hello',
'en':'hello',
'fa':'سلام

Upvotes: 0

Giacomo Alzetta
Giacomo Alzetta

Reputation: 2479

If you want a simple shell script you can do this:

  • strip leading and trailing whitespaces from all the lines
  • sort the lines
  • compare files using comm to get common lines between the files

So you'd do:

sed 's/^\s*//;s/\s*$//' first_file.py | sort > sorted_first_file.py
sed 's/^\s*//;s/\s*$//' second_file.py | sort > sorted_second_file.py
comm -12 sorted_first_file.py sorted_second_file.py

Now to compare every pair of files in your source tree you could:

  • first run the sed + sort command above and produce sorted_ versions of those files to reduce the time
  • then run a find command which, for every filename, runs a find command and compares that file to all other files.

Something along the lines of:

# remove all leading/trailing spaces
find . -name '*.py' -exec sed -i.no_spaces 's/^\s*//;s/\s*$//' {} \;

# on my machine, for some reason, only when using -i the above sed command
# does not work and I have to split it in two.
#find . -name '*.py' -exec sed -i.no_spaces 's/^\s*//' {} \;
#find . -name '*.py.no_spaces' -exec sed -i 's/\s*$//' {} \;

# sort all files
find . -name '*.py.no_spaces' -exec sort {} -o {}.sorted \;

And then the final step is the "double find":

for filename in $(find . -name '*.py.no_spaces.sorted');
do
    find . -name '*.py.no_spaces.sorted' -not -path "*$filename*" -exec comm -12 "$filename" {} \;
done

This should output the common lines between all files.

Note: you probably want to remove empty lines too. You can do this with a grep before this final step.


NOTE: if you have a relatively large number of files this will take forever to finish. Since the algorithm is O(n^2), so if you have 1000 files it does 1000000 calls to comm -12.

Upvotes: 1

Related Questions