Reputation: 1301
So here is the scenario. I have an unsorted array (very large) called gallery which contains pairs of templates (std::vector<uint8_t>
) and their associated IDs (std::string
).
I have a function in which I am provided with a template and must return the IDs of the k
most similar templates in my gallery (I am using cosine similarity to generate a similarity score between the templates).
I considered using a heap as discussed in this post.
However, the issue is that gallery can contain multiple different templates which belong to a single ID. In my function, I must return k
unique IDs.
For context, I am doing a facial recognition application. I can have multiple different templates belonging to a single person in my gallery (the person was enrolled in the gallery using multiple different images, hence multiple templates belonging to their ID). The search function should return the k
most similar people to the provided template (and thus not return the same ID more than once).
Would appreciate an efficient algorithm for doing so in C++.
Edit: Code snipped for my proposed solution with heap (does not deal with duplicates properly)
std::priority_queue<std::pair<double, std::string>, std::vector<std::pair<double, std::string> >, std::greater<> > queue;
for(const auto& templPair : m_gallery) {
try{
double similairty = computeSimilarityScore(templPair.templ, idTemplateDeserial);
if (queue.size() < candidateListLength) {
queue.push(std::pair<double, std::string>(similairty, templPair.id));
} else if (queue.top().first < similairty) {
queue.pop();
queue.push(std::pair<double, std::string>(similairty, templPair.id));
}
} catch(...) {
std::cout << "Unable to compute similarity\n";
continue;
}
}
// CandidateListLength number of IDs with the highest scores will be in queue
Here is an example to hopefully help. For the sake of simplicity, I will assume that the similarity score has already been computed for the templates.
Template 1: similarity score: 0.4, ID: Cyrus
Template 2: similarity score: 0.5, ID: James
Template 3: similarity score: 0.9, ID: Bob
Template 4: similarity score: 0.8, ID: Cyrus
Template 5: similarity score: 0.7, ID: Vanessa
Template 6: similarity score: 0.3, ID: Ariana
Getting the IDs of the top 3 scoring templates will return [Bob, Cyrus, Vanessa]
Upvotes: 0
Views: 164
Reputation: 1301
Implemented the answer outline by Maras. It seems to do the job.
#include <iostream>
#include <vector>
#include <map>
#include <utility>
#include <string>
#include <set>
int main() {
int K = 3;
std::vector<std::pair<double, std::string>> data {
{0.4, "Cyrus"},
{0.5, "James"},
{0.9, "Bob"},
{0.8, "Cyrus"},
{0.7, "Vanessa"},
{0.3, "Ariana"},
};
std::set<std::pair<double, std::string>> mySet;
std::map<std::string, double> myMap;
for (const auto& pair: data) {
if (myMap.find( pair.second ) == myMap.end()) {
// The ID is unique
if (mySet.size() < K) {
// The size of the set is less than the size of search candidates
// Add the result to the map and the set
mySet.insert(pair);
myMap[pair.second] = pair.first;
} else {
// Check to see if the current score is larger than the worst performer in the set
auto worstPairPtr = mySet.begin();
if (pair.first > (*worstPairPtr).first) {
// The contender performed better than the worst in the set
// Remove the worst item from the set, and add the contender
// Remove the corresponding item from the map, and add the new contender
mySet.erase(worstPairPtr);
myMap.erase((*worstPairPtr).second);
mySet.insert(pair);
myMap[pair.second] = pair.first;
}
}
} else {
// The ID already exists
// Compare the contender score to the score of the existing ID.
// If the contender score is better, replace the existing item score with the new score
// Remove the old item from the set
if (pair.first > myMap[pair.second]) {
mySet.erase({myMap[pair.second], pair.second});
mySet.insert(pair);
myMap[pair.second] = pair.first;
}
}
}
for (auto it = mySet.rbegin(); it != mySet.rend(); ++it) {
std::cout << (*it).second << std::endl;
}
}
The output is
Bob
Cyrus
Vanessa
Upvotes: 1
Reputation: 982
Use std::set structure (balanced BST) instead of heap. It also puts elements in order, lets you find the largest and the smallest element inserted. In addition, it automatically detects a duplicate when using insert function and ignores it, so each element inside will always be unique. Complexity is exactly the same (it is a bit slower though because of a larger constant).
Edit: I probably did not understand the question properly. As far as I can see you can have multiple elements with different values which should be considered a duplicate.
What I would do:
Make a map where key is an ID and value is a template value of a template currently in the set.
If you want to add a new template:
Upvotes: 2