Reputation: 1805
I have many python scripts in my project and some of them contain dicts/lists/sets like:
hi.py
hi_dict = {
'de': 'hallo',
'en': 'hello',
}
But some files contain similar or even the same dict:
maintenace/hello.py
class hello(superclass):
hello_dict = {
'en': 'hello',
'fa': 'سلام',
}
I would like to merge same purpose dicts into one (which could be imported from the other occurences). But first I need to find such dicts.
How can I search all my Python scripts (using Python or Unix terminal commands) for duplicate lines (in this case 'en': 'hello',
), but exclude initial/trailing spaces and blank lines?
I've found many answers, how to find duplicate lines in already sorted text files, but none of them stated, how to treat unsorted and full-of-blank-lines Python scripts, also none of them said how to exclude initial/trailing spaces.
Note: I use git, so I can damage scripts in order to get the result in any way and then restore them back from last commit easily.
A solution to this could also find duplicate code, which I could merge and reduce code complexity, so definitely this could be applicable also to make better score on codeclimate or to make the whole framework faster.
Upvotes: 1
Views: 70
Reputation: 20012
When the dict part always start with a line ending with dict = {
, you can select the dict definitions first (perhaps change the search).
In the dict part you are only interested in lines with a :
, so print those.
Remove the formatting, so remove spaces and tabs (perhaps also the ,
). You can sort (and uniq -c
) them.
sed -n '/dict = {/,/}/ s/:/:/p' inputfiles|tr -d ' \t' | sort
Result is something like
'de':'hallo',
'en':'hello',
'en':'hello',
'fa':'سلام
Upvotes: 0
Reputation: 2479
If you want a simple shell script you can do this:
comm
to get common lines between the filesSo you'd do:
sed 's/^\s*//;s/\s*$//' first_file.py | sort > sorted_first_file.py
sed 's/^\s*//;s/\s*$//' second_file.py | sort > sorted_second_file.py
comm -12 sorted_first_file.py sorted_second_file.py
Now to compare every pair of files in your source tree you could:
sed
+ sort
command above and produce sorted_
versions of those files to reduce the timefind
command which, for every filename, runs a find
command and compares that file to all other files.Something along the lines of:
# remove all leading/trailing spaces
find . -name '*.py' -exec sed -i.no_spaces 's/^\s*//;s/\s*$//' {} \;
# on my machine, for some reason, only when using -i the above sed command
# does not work and I have to split it in two.
#find . -name '*.py' -exec sed -i.no_spaces 's/^\s*//' {} \;
#find . -name '*.py.no_spaces' -exec sed -i 's/\s*$//' {} \;
# sort all files
find . -name '*.py.no_spaces' -exec sort {} -o {}.sorted \;
And then the final step is the "double find
":
for filename in $(find . -name '*.py.no_spaces.sorted');
do
find . -name '*.py.no_spaces.sorted' -not -path "*$filename*" -exec comm -12 "$filename" {} \;
done
This should output the common lines between all files.
Note: you probably want to remove empty lines too. You can do this with a grep
before this final step.
NOTE: if you have a relatively large number of files this will take forever to finish. Since the algorithm is O(n^2), so if you have 1000 files it does 1000000 calls to comm -12
.
Upvotes: 1