bhavithra
bhavithra

Reputation: 155

When to use k means clustering algorithm?

Can I use k-means algorithm for a single attribute?

Is there any relationship between the attributes and the number of clusters?

I have one attribute's performance, and I want to classify the data into 3 clusters: poor, medium, and good.

Is it possible to create 3 clusters with one attribute?

Upvotes: 2

Views: 11409

Answers (5)

Yun Zhao
Yun Zhao

Reputation: 155

With only one attribute, you don't need to do k-means. First, I would like to know if your attribute is numerical or categorical.

If it's numerical, it would be easier to set up two thresholds. And if it's categorical, things are getting much easier. Just specify which classes belong to poor, medium or good. Then simple data frame operations would be working.

Feel free to send me comments if you are still confused.

Rowen

Upvotes: 0

Sau001
Sau001

Reputation: 1664

As others have answered already, k-means requires prior information about the count of clusters. This may appear to be not very helpful at the start. But, I will cite the following scenario which I worked with and found to be very helpful.

Color segmentation

Think of a picture with 3 channels of information. (Red, Green Blue) You want to quantize the colors into 20 different bands for the purpose of dimensional reduction. We call this as vector quantization.

Every pixel is a 3 dimensional vector with Red, Green and Blue components. If the image is 100 pixels by 100 pixels then you have 10,000 vectors.

R,G,B
128,100,20
120,9,30
255,255,255
128,100,20
120,9,30
.
.
.

Depending on the type of analysis you intend to perform, you may not need all the R,G,B values. It might be simpler to deal with an ordinal representation. In the above example, the RGB values might be assigned a flat integral representation

R,G,B
128,100,20 => 1
120,9,30   => 2
255,255,255=> 3
128,100,20 => 1
120,9,30   => 2

You run the k-Means algorithm on these 10,000 vectors and specify 20 clusters. Result - you have reduced your image colors to 20 broad buckets. Obviously some information is lost. However, the intuition for this loss being acceptable is that when the human eyes is gazing out over a patch of green meadow, we are unlikely to register all the 16 million RGB colours.

YouTube video

https://www.youtube.com/watch?v=yR7k19YBqiw I have embedded key pictures from this video for your understanding. Attention! I am not the author of this video.

Original image

enter image description here

After segmentation using K means

enter image description here

Upvotes: 2

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

If you have one dimensional data, search stackoverflow for better approaches than k-means.

K-means and other clustering algorithms shine when you have multivariate data. They will "work" with 1-dimensional data, but they are not very smart anymore.

One-dimensional data is ordered. If you sort your data (or it even is already sorted), it can be processed much more efficiently than with k-means. Complexity of k-means is "just" O(n*k*i), but if your data is sorted and 1-dimensional you can actually improve k-means to O(k*i). Sorting comes at a cost, but there are very good sort implementations everywhere...

Plus, for 1-dimensional data there is a lot of statistics you can use that are not very well researched or tractable on higher dimensions. One statistic you really should try is kernel density estimation. Maybe also try Jenks Natural Breaks Optimization.

However, if you want to just split your data into poor/medium/high, why don't you just use two thresholds?

Upvotes: 2

Brian
Brian

Reputation: 7316

K-Means is useful when you have an idea of how many clusters actually exists in your space. Its main benefit is its speed. There is a relationship between attributes and the number of observations in your dataset.

Sometimes a dataset can suffer from The Curse of Dimensionality where your number of variables/attributes is much greater than your number of observations. Basically, in high dimensional spaces with few observations, it becomes difficult to separate observations in hyper dimensions.

You can certainly have three clusters with one attribute. Consider the quantitative attribute in which you have 7 observations

  • 1
  • 2
  • 100
  • 101
  • 500
  • 499
  • 501

Notice there are three clusters in this sample centered: 1.5, 100.5, and 500.

Upvotes: 3

nth
nth

Reputation: 141

Yes it is possible to use clustering with single attribute.

No there is no known relation between number of cluster and the attributes. However there have been some study that suggest taking number of clusters (k)=n\sqrt{2}, where n is the total number of items. This is just one study, different study have suggested different cluster numbers. The best way to determine cluster number is to select that cluster number that minimizes intra-cluster distance and maximizes inter-cluster distance. Also having background knowledge is important.

The problem you are looking with performance attribute is more a classification problem than a clustering problem Difference between classification and clustering in data mining?

Upvotes: 1

Related Questions