Reputation: 5390
I'm experimenting with an idea, where I have following subproblem:
I have a list of size m
containing tuples of fixed length n
.
[(e11, e12, .., e1n), (e21, e22, .., e2n), ..., (em1, em2, .., emn)]
Now, given some random tuple (t1, t2, .., tn)
, which does not belong to the list, I want to find the closest tuple(s), that belongs to the list.
I use the following distance function (Hamming distance):
def distance(A, B):
total = 0
for e1, e2 in zip(A, B):
total += e1 == e2
return total
One option is to use exhaustive search, but this is not sufficient for my problem as the lists are quite large. Other idea, I have come up with, is to first use kmedoids
to cluster the list and retrieve K
medoids (cluster centers). For querying, I can determine the closest cluster with K
calls to distance function. Then, I can search for the closest tuple from that particular cluster. I think it should work, but I am not completely sure, if it is fine in cases the query tuple is on the edges of the clusters.
However, I was wondering, if you have a better idea to solve the problem as my mind is completely blank at the moment. However, I have a strong feeling that there may be a clever way to do it.
Solutions that require precomputing something are fine as long as they bring down the complexity of the query.
Upvotes: 2
Views: 132
Reputation: 164
If your data is big enough, you may want to create some inverted indexes over it.
With a data of m vectors of n elements.
Data:
0: 1, 2, 3, 4, 5, ...
1: 2, 3, 1, 5, 3, ...
2: 5, 3, 2, 1, 3, ...
3: 1, 2, 1, 5, 3, ...
...
m: m0, ... mn
Then you want to get n indexes like this:
Index0
1: 0, 3
2: 1
5: 2
Index1
2: 0, 3
3: 3, 3
Index2
3: 0
1: 1, 3
2: 2
...
Then you only search on your indexes to get the tuples that contain any of the query tuple values and find the closest tuple within those.
def search(query)
candidates = []
for i in range(len(query))
value = query[i]
candidates.append(indexes[i][value])
# find candidates with min distance
for candidate in candidates
distance = distance(candidate, query)
...
The heavy process is creating the indexes, once you built them the search will be really fast.
Upvotes: 1
Reputation: 178521
You can store a hash table (dictionary/map) that maps from an element (in the tupple) to the tupples it appears in: hash:element->list<tupple>
.
Now, when you have a new "query", you will need to iterate each of hash(element)
for each element of the new "query", and find the maximal number of hits.
pseudo code:
findMax(tuple):
histogram <- empty map
for each element in tuple:
#assuming hash_table is the described DS from above
for each x in hash_table[element]:
histogram[x]++ #assuming lazy initialization to 0
return key with highest value in histogram
An alternative, that does not exactly follow the metric you desired is a k-d tree. The difference is k-d tree also take into consideration the "distance" between the elements (and not only equality/inequality).
k-d trees also require the elements to be comparable.
Upvotes: 3