Closest point algorithm | How to improve it?

Question

I wrote a k-means clustering algorithm and a color quantization algorithm. They work as expected in terms of results but I want to make them faster. In both implementations I need to solve a problem: there are two arrays of points in a 3D space, then for each point of the first array, you need to find the closest point from the second array. I do it like this:

size_t closest_cluster_index;
double x_dif, y_dif, z_dif;
double old_distance;
double new_distance;

for (auto point = points.begin(); point != points.end(); point++)
{
    //FIX
    //as suggested by juvian
    //K = 1
    if (point != points.begin())
    {
        auto cluster = &(clusters[closest_cluster_index]);

        r_dif = cluster->r - point->r;
        g_dif = cluster->g - point->g;
        b_dif = cluster->b - point->b;

        new_distance = r_dif * r_dif + g_dif * g_dif + b_dif * b_dif;

        if (new_distance <= std::sqrt(old_distance) - ColorU8::differenceRGB(*(point - 1), *point))
        {
            old_distance = new_distance;
            //do sth with closest_cluster_index;
            continue;
        }
    }
    //END OF FIX

    old_distance = std::numeric_limits::infinity();

    for (auto cluster = clusters.begin(); cluster != clusters.end(); cluster++)
    {
        x_dif = cluster->x - point->x;
        y_dif = cluster->y - point->y;
        z_dif = cluster->z - point->z;

        new_distance = x_dif * x_dif + y_dif * y_dif + z_dif * z_dif;

        if (new_distance < old_distance)
        {
            old_distance = new_distance;
            closest_cluster_index = cluster - clusters.begin();
        }
    }
    //do sth with: closest_cluster_index
}

How can I improve it? (I don't want to make it multithreaded or computed by GPU)

juvian · Accepted Answer

There are multiple data structures for efficient nearest neighbour queries. For 3d, a kdtree works really well, and has a complexity of O(log n) for each query on average which would improve your current O(n).

So with this structure you can add all your points from clusters to it, and then for each point in points, you can use the structure to query the closest point. For your particular case, a static kdtree is enough, as you don´t need to update points.

Another approach:

We can try to risk doing extra computations on some points in exchange for fewer on others. This method should work well with the following assumptions:

The distance between a cluster with another is far
The distance between a point and the adjacent point is low

I think these apply to your case because your clusters are few colors and your points come from a real image, which tends to have similar colors between adjacent pixels.

For each point, create a heap. Instead of storing the closest cluster, store in the max heap the closest k clusters. When you move to the next point, we can use this information. Let's call this point P and its kth closest cluster C.

Now for a new point P2, before comparing to all clusters we will check if the closest cluster to P2 is in our heap. This can only be true if the distance between any cluster from the heap and P2 is <= distance(P, C) - distance(P, P2). When this holds true, we can check only in our heap instead of all clusters. When it is not true, we compare against all and rebuild our heap and P will be P2.

You will need to try out different values of k to see if it improves. For the case of K = 2, might be worth avoiding the added complexity of a heap and just use variables.

Closest point algorithm | How to improve it?

Answers (1)

Related Questions