Reputation: 791
I have a dictionary I created by reading in a whole lot of image files. It looks like this:
files = { 'file1.png': [data...], 'file2.png': [data...], ... 'file1000': [data...]}
I am trying to process these images to see how similar each of them are to each other. The thing is, with 1000s of files worth of data this is taking forever. I'm sure I have 20 different places I could optimize but I am trying to work through it one piece at a time to see how I can better optimize it.
My original method tested file1 against all of the rest of the files. Then I tested file2 against all of the files. But I still tested it against file1. So, by the time I get to file1000 in the above example I shouldn't even need to test anything at that point since it has already been tested 999 times.
This is what I tried:
answers = {}
for x in files:
for y in files:
if y not in answers or x not in answers[y]:
if(compare(files[x],files[y]) < 0.01):
answers.setdefault(x, []).append(y)
This doesn't work, as I am getting the wrong output now. The compare function is just this:
rms = math.sqrt(functools.reduce(operator.add,map(lambda a,b: (a-b)**2, h1[0], h2[0]))/len(h1[0]))
return rms
I just didn't want to put that huge equation into the if statement.
Does anyone have a good method for comparing each of the data segments of the files dictionary without overlapping the comparisons?
Edit:
After trying ShadowRanger's answer I have realized that I may not have fully understood what I needed. My original answers dictionary looked like this:
{ 'file1.png': ['file1.png', 'file23.png', 'file333.png'],
'file2.png': ['file2.png'],
'file3.png': ['file3.png', 'file4.png', 'file5.png'],
'file4.png': ['file3.png', 'file4.png', 'file5.png'],
...}
And for now I am storing my results in a file like this:
file1.png file23.png file33.png
file2.png
file3.png file4.png file5.png
file6.png
...
I thought that by using combinations and only testing individual files once I would save a lot of time retesting files and not have to waste time getting rid of duplicate answers. But as far as I can tell, the combinations have actually reduced my ability to find matches and I'm not sure why.
Upvotes: 2
Views: 719
Reputation: 155353
You can avoid redundant comparisons with itertools.combinations
to get order-insensitive unique pairs. Just import itertools
and replace your doubly nested loop:
for x in files:
for y in files:
with a single loop that gets the combinations:
for x, y in itertools.combinations(files, 2):
Upvotes: 2