very hit
very hit

Reputation: 326

Algorithm to find top K elements with unique label

I have a custom struct data:

struct mydata
{
    double distance;
    string label;
}

I will generate lots of mydata in a loop. And I want to get the top minium disatnce items meanwhile theirs label must be unique. Now I am using the max heap to solve this problem. My algorithm like this:

// get topK items with unique label
for i = 1:N
{
    mydata item = generate_a_data();
    if (max_heap.size() < K)
    {
        insert_to_max_heap(item);
    }
    else // max_heap is full
    {
        if (item.distance < max_heap(top).distance)
        {
            insert_to_max_heap(item);
        }
    }
}

The problem happened in the insert_to_max_heap(), because the constraint of unique label, I cannot just replace the top node in the max heap with new item, so I have to iterate all elements in the heap to find whether the same label exists. If it exists a node has the same label, I just update the distance of old node. pseudocode :

insert_to_max_heap(item)
{
    for_each node in max_heap
    {
        if (node.label == item.label)
        {
            if (node.distance > item.distance)
            {
                // update min distance
                node.distance = item.distance;
            }
            return;
        }
    }
    // no identical label, replace the top node
    max_heap.top = item;
    sort_max_heap();
}

Is there more efficient way to improve my algorithm or new idea to solve th problem? Algorithm should be as fast as possible, and I don't have enough space to save all items in the loop.

Upvotes: 2

Views: 404

Answers (1)

kensou97
kensou97

Reputation: 31

I think you need to maintain a hash map which the key is label and the value is the position(or pointer) of the struct in your max heap.

When a new mydata is generated,check if a struct with the same label exists in the hash map firstly.If 'yes', determine whether to substitute it(after substituting,shift it down in the heap if necessary) or not according to the distance,otherwise determine whether to insert the new mydata to your heap or not,and don't forget to update your hash map at the same time.

Upvotes: 2

Related Questions