Kmeans clustering on different distance function in Lab space

Problem: To cluster the similar colour pixels in CIE LAB using K means.

I want to use CIE 94 for distance between 2 pixels Formula of CIE94

What i read was Kmeans work in "Euclidean space" where the positional cordinates are minimised by cost function which is (sum of squared difference) The reason of not Using kmeans in space other than euclidean is """algorithm is often presented as assigning objects to the nearest cluster by distance. The standard algorithm aims at minimizing the within-cluster sum of squares (WCSS) objective, and thus assigns by "least sum of squares", which is exactly equivalent to assigning by the smallest Euclidean distance. Using a different distance function other than (squared) Euclidean distance may stop the algorithm from converging""(source wiki)

So how to use distance CIE 94 in LAB SPACE for similar colour clustering ?

So how to approach the problem ? What should be the minimisation function here ? HOW to map euclidean space to lab space if for the k mean euclidean formula to work ? Any other approach here ?

Upvotes: 1

Answers (2)

andrew

Reputation: 2469

The reason that CIE LAB is often used for clustering is because it reduces the color to 2 dimensions (as opposed to RGB with 3 color channels). You can easily think of the color for each pixel in a Cartesian coordinate system, instead of points (x,y) you have points (a,b) From here you simply perform a 2d kmeans.

Exactly how you implement kmeans is up to you. The nice thing about reducing colors to a 2d space is we can imagine the data on a grid, and now we can use any regular distance measure we want. Mahalonobis, euclidean, 1 norm, city block, etc. The possibilities are really endless here.

You don't have to use CIELAB, you can just as easily use YCbCr, YUV, or any other colorspace that represents color in 2 dimensions. IF you wanted to try a 3d kmeans you could use rgb, hsv, etc. One problem with higher dimensionality is sparsity of clusters (large variance) and most importantly, increased computation time.

Just for fun I've included two images clustered using kmeans, one in LAB and one in YCbCr, you can see the clustering is nearly identical (except that the labels are different), just proving that the exact color space is irrelevant, the main point is to match the dimensionality of your kmeans with that of your data

EDIT

You made some good points in your comments. I was merely demonstrating that by abstracting the problem you can imagine many variations for the same basic clustering algorithm. But you are right, there are advantages to using CIELAB

Back to the distance measure. Kmeans has two steps, assignment, and update (it is very similar to the Expectation Maximization algorithm). This distance is used in assignment step of k-means. Here is some psuedo code

for each pixel 1 to rows*cols
    for each cluster 1 to k
        dist[k] = calculate_distance(pixel, mu[k])

    pixel_id = index k of minimum dist

you would create a function calculate_distance that uses the delta_e calculation from cielab94. This formula uses all 3 channels to calculate distance. Hopefully this answers your questions

NOTE My examples only use the 2 color channels, ignoring the luminance channel. I used this technique since often the goal is group colors despite lighting disparities(such as shadows). The delta_E measure is not lighting invariant. This may or may not be a concern for your application, but it is something to keep in mind.

results using square euclidean distance lab cluster sq euclidean ycbcr cluster sq euclidean