Reputation: 335
I have a dataset of images, which are sorted with respect to their interestingness. I want to create a diverse subset by computing the pairwise similarities between a candidate element and all current members of the subset.
My code is below. It looks like working correctly; however, as you can guess it is very slow (takes 6-7 hours).
I wonder if there is another way to create this subset faster than this implementation.
dataset = [a,b,c,d,e,f,g,...] #an example sorted dataset. real size is 20K
subset = [] #my diverse set. Desired size is 200
for ind, element in enumerate(dataset):
if ind == 0:
subset.append(element[ind]) #since the dataset is sorted I can directly add the first element
else:
similarities = []
for member in subset:
similarities.append(compute_similarity(element[ind], member))
if not any(similarities): #if the candidate is not similar to any of the existing members
subset.append(element[ind])
if len(subset) >= 200:
break
Upvotes: 2
Views: 178
Reputation: 427
The main improvement I see is to break as soon as you find that the current element is similar to some element already in your set. This allows for a simplification of your code, even eliminating the need for the extra check for the first element.
dataset = [a,b,c,d,e,f,g,...] #an example sorted dataset. real size is 20K
subset = [] #my diverse set. Desired size is 200
for candidate in dataset:
for member in subset:
if compute_similarity(candidate, member):
break
else:
subset.append(candidate)
if len(subset) >= 200:
break
Note the for-else construct, which only executes the "else" branch if the for-loop did not break.
Any other optimizations would depend on your actual similarity metric. Maybe you can somehow sort the data, thus reducing which comparisons you need to do.
Upvotes: 1