Timo
Timo

Reputation: 5390

retrieve closest element from a set of elements

I'm experimenting with an idea, where I have following subproblem:

I have a list of size m containing tuples of fixed length n.

[(e11, e12, .., e1n), (e21, e22, .., e2n), ..., (em1, em2, .., emn)]

Now, given some random tuple (t1, t2, .., tn), which does not belong to the list, I want to find the closest tuple(s), that belongs to the list.

I use the following distance function (Hamming distance):

def distance(A, B):
    total = 0
    for e1, e2 in zip(A, B):
        total += e1 == e2
    return total

One option is to use exhaustive search, but this is not sufficient for my problem as the lists are quite large. Other idea, I have come up with, is to first use kmedoids to cluster the list and retrieve K medoids (cluster centers). For querying, I can determine the closest cluster with K calls to distance function. Then, I can search for the closest tuple from that particular cluster. I think it should work, but I am not completely sure, if it is fine in cases the query tuple is on the edges of the clusters.

However, I was wondering, if you have a better idea to solve the problem as my mind is completely blank at the moment. However, I have a strong feeling that there may be a clever way to do it.

Solutions that require precomputing something are fine as long as they bring down the complexity of the query.

Upvotes: 2

Views: 132

Answers (2)

Sheol
Sheol

Reputation: 164

If your data is big enough, you may want to create some inverted indexes over it.

With a data of m vectors of n elements.

Data:

0: 1, 2, 3, 4, 5, ...
1: 2, 3, 1, 5, 3, ...
2: 5, 3, 2, 1, 3, ...
3: 1, 2, 1, 5, 3, ...
...
m: m0, ... mn

Then you want to get n indexes like this:

Index0

1: 0, 3
2: 1
5: 2

Index1

2: 0, 3
3: 3, 3

Index2

3: 0
1: 1, 3
2: 2

...

Then you only search on your indexes to get the tuples that contain any of the query tuple values and find the closest tuple within those.

def search(query)
  candidates = []
  for i in range(len(query))
    value = query[i]
    candidates.append(indexes[i][value])

  # find candidates with min distance
  for candidate in candidates
    distance = distance(candidate, query)
    ...  

The heavy process is creating the indexes, once you built them the search will be really fast.

Upvotes: 1

amit
amit

Reputation: 178521

You can store a hash table (dictionary/map) that maps from an element (in the tupple) to the tupples it appears in: hash:element->list<tupple>.

Now, when you have a new "query", you will need to iterate each of hash(element) for each element of the new "query", and find the maximal number of hits.

pseudo code:

findMax(tuple):
  histogram <- empty map  
  for each element in tuple:
     #assuming hash_table is the described DS from above
     for each x in hash_table[element]: 
         histogram[x]++ #assuming lazy initialization to 0
  return key with highest value in histogram

An alternative, that does not exactly follow the metric you desired is a k-d tree. The difference is k-d tree also take into consideration the "distance" between the elements (and not only equality/inequality).
k-d trees also require the elements to be comparable.

Upvotes: 3

Related Questions