Reputation: 874
Im studying machine learning and want to implement K means clustering to understand it better. I have a dataset of cats, with each having 4 measurements. I want to cluster them into 2 and 3 distinct hard clusters based on these properties to see if their breeds can be determined based on these measurements.
I am familiar with the general algorithm, but I am struggling to get my head around
(x,y)
tuple and a cat (which has 4 properties)? Generally in Euclidean distance you compare x
, y
etc values against each other but if I have 4 properties how can I measure how far it is from (x,y)
pair on a 2d plane? It doesn't make much sense to me even after reading up on this concept.
I believe that on a 2d plane I can indeed only look at two properties out of the 4 - or is this not correct? Without compressing data dimension to 2 I dont see how one could do that.
Ps: I know there is libraries that implement K-means clustering, that is not the point.
Upvotes: 3
Views: 511
Reputation: 930
What you're referring to in your question is the euclidean distance of two points in a 2D plane. You want to perform clustering in a plane where each properties vector itself is a data point, which is not possible with 2D planes. Hence, you want to deal with an n-dimensional plane, where each data point is an n-dimensional vector. Each of these dimensions represents a feature. In your case, n
is 4
, since you have 4 features (properties) per data point.
You can randomize centroids by choosing any vector that has values ranging from minimum of all the feature vectors to their maximum.
Let's say you have 3 different cats with the following properties: [1, 5, 9, 10], [2, 3, 4, 3], [5, 6, 1, 5]
. These are nothing but feature vectors. You will run the clustering as below:
You begin by computing the min
and max
vectors.
min = [1, 3, 1, 3]
and max = [5, 6, 9,10]
. So you assign centroids in the following range: [1...5, 3...6, 1...9, 3...10]
.
Once the centroids are initialized (either randomly or based on heuristic estimates), you run the algorithm and recompute centroids on each iteration.
You calculate the euclidean distance as the euclidean distance of 2 vectors:
where qi
is the i
th element in vector q
°
Hope it helped!
Upvotes: 3