kingchris
kingchris

Reputation: 1757

Recommend C++ container to hold top 20 minimum values

In SQL there is the feature to say something like

SELECT TOP 20 distance FROM dbFile ORDER BY distance ASC

If my SQL is correct with, say 10,000 records, this should return the 20 smallest distances in my databse.

I don't have a database. I have a 100,000-element simple array.

Is there a C++ container, Boost, MFC or STL that provides simple code for a struct like

struct closest{
    int    ID;
    double distance;
    closest():ID(-1), distance(std::numeric_limits<double>::max( )){}
};

Where I can build a sorted by distance container like

boost::container::XXXX<closest> top(20);

And then have a simple

top.replace_if(closest(ID,Distance));

Where the container will replace the entry with the current highest distance in my container with my new entry if it is less than the current highest distance in my container.

I am not worried about speed. I like elegant clean solutions where containers and code do all the heavy lifting.

EDIT. Addendum after all the great answers received.

What I really would of liked to have found, due to its elegance. Is a sorted container that I could create with a container size limit. In my case 20. Then I could push or insert to my hearts content a 100 000 items or more. But. There is always a but. The container would of maintained the max size of 20 by replacing or not inserting an item if its comparator value was not within the lowest 20 values.

Yes. I know now from all these answers that via programming and tweaking existing containers the same effect can be achieved. Perhaps when the next round of suggestions for the C & C++ standards committee sits. We could suggest. Self sorting (which we kind of have already) and self size limiting containers.

Upvotes: 20

Views: 3775

Answers (9)

moooeeeep
moooeeeep

Reputation: 32542

I actually have 100 000 Lat & Lon points drawn on a opengl sphere. I want to work out the 20 nearest points to each of the 100 000 points. So we have two loops to pick each point then calculate that point against every other point and save the closest 20 points.

This reads as if you want to perform a k-nearest neighbor search in the first place. For this, you usually use specialized data structures (e.g., a binary search tree) to speed up the queries (especially when you are doing 100k of them).

For spherical coordinates you'd have to do a conversion to a cartesian space to fix the coordinate wrap-around. Then you'd use an Octree or kD-Tree.

Here's an approach using the Fast Library for Approximate Nearest Neighbors (FLANN):

#include <vector>
#include <random>
#include <iostream>
#include <flann/flann.hpp>
#include <cmath>

struct Point3d {
    float x, y, z;
    void setLatLon(float lat_deg, float lon_deg) {
        static const float r = 6371.; // sphere radius
        float lat(lat_deg*M_PI/180.), lon(lon_deg*M_PI/180.);
        x = r * std::cos(lat) * std::cos(lon);
        y = r * std::cos(lat) * std::sin(lon);
        z = r * std::sin(lat);
    }
};

std::vector<Point3d> random_data(std::vector<Point3d>::size_type n) {
    static std::mt19937 gen{std::random_device()()};
    std::uniform_int_distribution<> dist(0, 36000);
    std::vector<Point3d> out(n);
    for (auto &i: out)
        i.setLatLon(dist(gen)/100., dist(gen)/100.);
    return out;
}

int main() {
    // generate random spherical point cloud
    std::vector<Point3d> data = random_data(1000);
    // generate query point(s) on sphere
    std::vector<Point3d> query = random_data(1);

    // convert into library datastructures
    auto mat_data = flann::Matrix<float>(&data[0].x, data.size(), 3);
    auto mat_query = flann::Matrix<float>(&query[0].x, query.size(), 3);
    // build KD-Tree-based index data structure
    flann::Index<flann::L2<float> > index(mat_data, flann::KDTreeIndexParams(4));
    index.buildIndex();
    // perform query: approximate nearest neighbor search
    int k = 5; // number of neighbors to find
    std::vector<std::vector<int>> k_indices;
    std::vector<std::vector<float>> k_dists;
    index.knnSearch(mat_query, k_indices, k_dists, k, flann::SearchParams(128));

    // k_indices now contains for each query point the indices to the neighbors in the original point cloud
    // k_dists now contains for each query point the distances to those neighbors
    // removed printing of results for brevity
}

You'd receive results similar to this one (click to enlarge):

flann result

For reference:

Upvotes: 4

ForeverStudent
ForeverStudent

Reputation: 2537

What you need is to have a maxheap of size 20. Recall that the root of your heap will be the largest value in the heap.

This heap will contain the records with smallest distance that you have encountered so far. For the first 20 out of 10000 values you just push to the heap.

At this point you iterate through the rest of the records and for each record, you compare it to the root of your heap.

Remember that the root of your heap is basically the very worst of the very best.(The record with the largest distance, among the 20 records with the shortest distance you have encountered so far).

If the value you are considering is not worth keeping (its distance is larger that the root of your tree), ignore that record and just keep moving.

Otherwise you pop your heap (get rid of the root) and push the new value in. The priority queue will automatically put its record with the largest distance on the root again.

Once you keep doing this over the entire set of 10000 values, you will be left with the 20 records that have the smallest distance, which is what you want.

Each push-pop takes constant O(1) time, iterating through all inputs of N is O(n) so this is a Linear solution.

Edit: I thought it would be useful to show my idea in C++ code. This is a toy example, you can write a generic version with templates but I chose to keep it simple and minimalistic:

#include <iostream>
#include <queue>
using namespace std;
class smallestElements
{
private:
    priority_queue<int,std::vector<int>,std::less<int> > pq;
    int maxSize;
public:
    smallestElements(int size): maxSize(size)
    {
    pq=priority_queue<int, std::vector<int>, std::less<int> >();
    }
    void possiblyAdd(int newValue)
    {
        if(pq.size()<maxSize)
        {
            pq.push(newValue);
            return;
        }
        if(newValue < pq.top())
        {
            pq.pop(); //get rid of the root
            pq.push(newValue); //priority queue will automatically restructure
        }
    }
    void printAllValues()
    {
        priority_queue<int,std::vector<int>,std::less<int> > cp=pq;
        while(cp.size()!=0)
        {
            cout<<cp.top()<<" ";
            cp.pop();
        }
        cout<<endl;
    }
};

How you use this is really straight forward. basically in your main function somewhere you will have:

smallestElements se(20); //we want 20 smallest
//...get your stream of values from wherever you want, call the int x
se.possiblyAdd(x); //no need for bounds checking or anything fancy
//...keep looping or potentially adding until the end

se.printAllValues();//shows all the values in your container of smallest values
// alternatively you can write a function to return all values if you want

Upvotes: 22

moooeeeep
moooeeeep

Reputation: 32542

I have posted a number of approaches to the similar problem of retrieving the top 5 minimum values recently here:

There are implementations that keep a specific number of smallest or greatest items from an input vector in different ways. The nth_element algorithm performs a partial sort, the priority queue maintains a heap, the set a binary search tree, and the deque- and vector-based approaches just remove an element based on a (linear) min/max search.

It should be fairly easy to implement a custom comparison operator and to adapt the number of items to keep n.

Here's the code (refactored based off the other post):

#include <algorithm>
#include <functional>
#include <queue>
#include <set>
#include <vector>
#include <random>
#include <iostream>
#include <chrono>

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_nth_element(std::vector<T> v, typename std::vector<T>::size_type n) {
    auto target = v.begin()+n;
    std::nth_element(v.begin(), target, v.end(), Compare());
    std::vector<T> result(v.begin(), target);
    return result;
}

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_pqueue(std::vector<T> v, typename std::vector<T>::size_type n) {
    std::vector<T> result;
    std::priority_queue<T, std::vector<T>, Compare> q;
    for (auto i: v) {
        q.push(i);
        if (q.size() > n) {
            q.pop();
        }
    }
    while (!q.empty()) {
        result.push_back(q.top());
        q.pop();
    }
    return result;
}

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_set(std::vector<T> v, typename std::vector<T>::size_type n) {
    std::set<T, Compare> s;
    for (auto i: v) {
        s.insert(i);
        if (s.size() > n) {
            s.erase(std::prev(s.end()));
        }
    }
    return std::vector<T>(s.begin(), s.end());
}

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_deque(std::vector<T> v, typename std::vector<T>::size_type n) {
    std::deque<T> q;
    for (auto i: v) {
        q.push_back(i);
        if (q.size() > n) {
            q.erase(std::max_element(q.begin(), q.end(), Compare()));
        }
    }
    return std::vector<T>(q.begin(), q.end());
}

template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_vector(std::vector<T> v, typename std::vector<T>::size_type n) {
    std::vector<T> q;
    for (auto i: v) {
        q.push_back(i);
        if (q.size() > n) {
            q.erase(std::max_element(q.begin(), q.end(), Compare()));
        }
    }
    return q;
}

template <typename Clock = std::chrono::high_resolution_clock>
struct stopclock {
    std::chrono::time_point<Clock> start;
    stopclock() : start(Clock::now()) {}
    ~stopclock() {
        auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start);
        std::cout << "elapsed: " << elapsed.count() << "ms\n";
    }
};

std::vector<int> random_data(std::vector<int>::size_type n) {
    std::mt19937 gen{std::random_device()()};
    std::uniform_int_distribution<> dist;
    std::vector<int> out(n);
    for (auto &i: out)
        i = dist(gen);
    return out;
}

int main() {
    std::vector<int> data = random_data(1000000);
    stopclock<> sc;
    std::vector<int> result = filter_nth_element(data, 5);
    std::cout << "smallest values: ";
    for (auto i : result) {
        std::cout << i << " ";
    }
    std::cout << "\n";
    std::cout << "largest values: ";
    result = filter_nth_element<int, std::greater<int>>(data, 5);
    for (auto i : result) {
        std::cout << i << " ";
    }
    std::cout << "\n";
}

Example output is:

$ g++ test.cc -std=c++11 && ./a.out
smallest values: 4433 2793 2444 4542 5557 
largest values: 2147474453 2147475243 2147477379 2147469788 2147468894 
elapsed: 123ms

Note that in this case only the position of the nth element is accurate with respect to the order imposed by the provided comparison operator. The other elements are guaranteed to be smaller/greater or equal to that one however, depending on the comparison operator provided. That is, the top n min/max elements are returned, but they are not correctly sorted.

Don't expect the other algorithms to produce results in a specific order either. (While the approaches using priority queue and set actually produce sorted output, their results have the opposite order).

For reference:

Upvotes: 4

graham.reeds
graham.reeds

Reputation: 16476

I would use nth_element like @juanchopanza suggested before he deleted it.

His code looked like:

bool comp(const closest& lhs, const closest& rhs)
{
    return lhs.distance < rhs.distance;
}

then

std::vector<closest> v = ....;
nth_element(v.begin(), v.begin() + 20, v.end(), comp);

Though if it was only ever going to be twenty elements then I would use a std::array.

Upvotes: 5

Dmitry Rubanovich
Dmitry Rubanovich

Reputation: 2627

Heap is the data structure that you need. pre-C++11 stl only had functions which managed heap data in your own arrays. Someone mentioned that boost has a heap class, but you don't need to go so far as to use boost if your data is simple integers. stl's heap will do just fine. And, of course, the algorithm is to order the heap so that the highest value is the first one. So with each new value, you push it on the heap, and, once the heap reaches 21 elements in size, you pop the first value from the heap. This way whatever 20 values remain are always the 20 lowest.

Upvotes: 2

MikeMB
MikeMB

Reputation: 21166

If this is about filtering the 20 smallest elements from a stream on the fly, then a solution based on std::priority_queue (or std::multiset) is the way to go.

However, if it is about finding the 20 smallest elements in a given arraym I wouldn't go for a special container at all, but simply the algorithm std::nth_element - a partial sorting algorithm that will give you the n smallest elements - EDIT: or std::partial_sort (thanks Jarod42) if the elements also have to be sorted. It has linear complexity and it's just a single line to write (+ the comparison operator, which you need in any case):

#include <vector>
#include <iostream>
#include <algorithm>

struct Entry {
    int ID;
    double distance;    
};

std::vector<Entry> data;    

int main() {
    //fill data;

    std::nth_element(data.begin(), data.begin() + 19, data.end(), 
        [](auto& l, auto& r) {return l.distance < r.distance; });

    std::cout << "20 elements with smallest distance: \n"; 
    for (size_t i = 0; i < 20; ++i) {
        std::cout << data[i].ID << ":" << data[i].distance << "\n";
    }
    std::cout.flush();
}

If you don't want to change the order of your original array, you would have to make a copy of the whole array first though.

Upvotes: 11

kingchris
kingchris

Reputation: 1757

Just so you can all see what I am currently doing which seems to work.

struct closest{
    int base_ID;
    int ID;
    double distance;

    closest(int BaseID, int Point_ID, 
    double Point_distance):base_ID(BaseID), 
    ID(Point_ID),distance(Point_distance){}
    closest():base_ID(-1), ID(-1),
    distance(std::numeric_limits<double>::max( )){}

    bool operator<(const closest& rhs) const
    {
        return distance < rhs.distance;
    }
};



void calc_nearest(void) 
{ 
    boost::heap::priority_queue<closest> svec;

    for (int current_gift = 0; current_gift < g_nVerticesPopulated; ++current_gift)
    {   double best_distance=std::numeric_limits<double>::max();    
        double our_distance=0.0;
        svec.clear();

        for (int all_other_gifts = 0; all_other_gifts < g_nVerticesPopulated;++all_other_gifts)
        {   
            our_distance = distanceVincenty(g_pVertices[current_gift].lat,g_pVertices[current_gift].lon,g_pVertices[all_other_gifts].lat,g_pVertices[all_other_gifts].lon);
            if (our_distance != 0.0)
            {
                if (our_distance < best_distance) // don't bother to push and sort if the calculated distance is greater than current 20th value
                    svec.push(closest(g_pVertices[current_gift].ID,g_pVertices[current_gift].ID,our_distance));

                if (all_other_gifts%100 == 0)
                {
                    while (svec.size() > no_of_closest_points_to_calculate) svec.pop(); // throw away any points above no_of_closest_points_to_calculate
                    closest t =  svec.top(); // the furthest of the no_of_closest_points_to_calculate points for optimisation
                    best_distance = t.distance;
                }
            }
        }
        std::cout << current_gift << "\n";
    }
}

As you can see. I have 100 000 lat & long points draw on an openGl sphere. I am calculating each point against every other point and only retaining currently the closest 20 points. There is some primitive optimisation going on by not pushing a value if it is bigger than the 20th closest point.

As I am used to Prolog taking hours to solve something I am not worried about speed. I shall run this overnight.

Thanks to all for your help. It is much appreciated. Still have to audit the code and results but happy that I am moving in the right direction.

Upvotes: 4

Marandil
Marandil

Reputation: 1052

I am not 100% sure, that there is no more elegant solution, but even std::set is pretty pretty.

All you have to do is to define a proper comparator for your elements (e.g. a > operator) and then do the following:

std::set<closest> tops(arr, arr+20)
tops.insert(another);
tops.erase(tops.begin());

Upvotes: 6

Mario
Mario

Reputation: 36517

My first idea would be using a std::map or std::set with a custom comparator for this (edit: or even better, a std::priority_queue as mentioned in the comments).

Your comparator does your sorting.

You essentially add all your elements to it. After an element has been added, check whether there are more than n elements inside. If there are, remove the last one.

Upvotes: 7

Related Questions