retrieve closest element from a set of elements

Question

I'm experimenting with an idea, where I have following subproblem:

I have a list of size m containing tuples of fixed length n.

[(e11, e12, .., e1n), (e21, e22, .., e2n), ..., (em1, em2, .., emn)]

Now, given some random tuple (t1, t2, .., tn), which does not belong to the list, I want to find the closest tuple(s), that belongs to the list.

I use the following distance function (Hamming distance):

def distance(A, B):
    total = 0
    for e1, e2 in zip(A, B):
        total += e1 == e2
    return total

One option is to use exhaustive search, but this is not sufficient for my problem as the lists are quite large. Other idea, I have come up with, is to first use kmedoids to cluster the list and retrieve K medoids (cluster centers). For querying, I can determine the closest cluster with K calls to distance function. Then, I can search for the closest tuple from that particular cluster. I think it should work, but I am not completely sure, if it is fine in cases the query tuple is on the edges of the clusters.

However, I was wondering, if you have a better idea to solve the problem as my mind is completely blank at the moment. However, I have a strong feeling that there may be a clever way to do it.

Solutions that require precomputing something are fine as long as they bring down the complexity of the query.

amit · Accepted Answer

You can store a hash table (dictionary/map) that maps from an element (in the tupple) to the tupples it appears in: hash:element->list.

Now, when you have a new "query", you will need to iterate each of hash(element) for each element of the new "query", and find the maximal number of hits.

pseudo code:

findMax(tuple):
  histogram <- empty map  
  for each element in tuple:
     #assuming hash_table is the described DS from above
     for each x in hash_table[element]: 
         histogram[x]++ #assuming lazy initialization to 0
  return key with highest value in histogram

An alternative, that does not exactly follow the metric you desired is a k-d tree. The difference is k-d tree also take into consideration the "distance" between the elements (and not only equality/inequality).
k-d trees also require the elements to be comparable.

retrieve closest element from a set of elements

Answers (2)

Related Questions