Pairwise comparison of elements of a set

Question

I have a dataset of images, which are sorted with respect to their interestingness. I want to create a diverse subset by computing the pairwise similarities between a candidate element and all current members of the subset.

My code is below. It looks like working correctly; however, as you can guess it is very slow (takes 6-7 hours).

I wonder if there is another way to create this subset faster than this implementation.

dataset = [a,b,c,d,e,f,g,...] #an example sorted dataset. real size is 20K
subset = [] #my diverse set. Desired size is 200

for ind, element in enumerate(dataset):
  if ind == 0:
    subset.append(element[ind]) #since the dataset is sorted I can directly add the first element
  else:
    similarities = []
    for member in subset:
      similarities.append(compute_similarity(element[ind], member))

    if not any(similarities): #if the candidate is not similar to any of the existing members
      subset.append(element[ind])

  if len(subset) >= 200:
    break

Philipp F · Accepted Answer

The main improvement I see is to break as soon as you find that the current element is similar to some element already in your set. This allows for a simplification of your code, even eliminating the need for the extra check for the first element.

dataset = [a,b,c,d,e,f,g,...] #an example sorted dataset. real size is 20K
subset = [] #my diverse set. Desired size is 200

for candidate in dataset:
  for member in subset:
    if compute_similarity(candidate, member):
      break
  else:
    subset.append(candidate)
    if len(subset) >= 200:
      break

Note the for-else construct, which only executes the "else" branch if the for-loop did not break.

Any other optimizations would depend on your actual similarity metric. Maybe you can somehow sort the data, thus reducing which comparisons you need to do.

Pairwise comparison of elements of a set

Answers (1)

Related Questions