Reputation: 1232
I am currently doing a K-means cluster analysis for some customer data at my company. I want to measure the performance of this cluster, I just don't know the library packages used to measure performance of it and I am also unsure if my clusters are grouped too close together.
The data feeding my cluster is a simple RFM (recency, frequency, & monetary value). I also included average order value per transaction by customer. I used the elbow method to determine the optimal number clusters to use. Data consists of 1400 customers and 4 metric values.
Attached is also an image of the cluster plot & R Code
drop = c('CUST_Business_NM')
#Cleaning & Scaling the Data
new_cluster_data = na.omit(data)
new_cluster_data = data[, !(names(data)%in%drop)]
new_cluster_data = scale(new_cluster_data)
glimpse(new_cluster_data)
#Elbow Method for Optimal Clusters
k.max <- 15
data <- new_cluster_data
wss <- sapply(1:k.max,
function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss})
#Plot out the Elbow
wss
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
#Create the Cluster
kmeans_test = kmeans(new_cluster_data, centers = 8, nstart = 1000)
View(kmeans_test$cluster)
#Visualize the Cluster
fviz_cluster(kmeans_test, data = new_cluster_data, show.clust.cent = TRUE, geom = c("point", "text"))
Upvotes: 0
Views: 1765
Reputation: 320
You probably do not want to measure the performance of cluster
but the performance of the cluster algorithm
, in this case kmeans
.
First, you need to be clear what cluster distance measure
you want to use. The result of the cluster computation is a dissimilarity matrix
, thus the choice of the distance measure is critical, you can play with euclidean
, manhattan
, any kind of correlation
or other distance measure, e.g., like this:
library("factoextra")
dis_pearson <- get_dist(yourdataset, method = "pearson")
dis_pearson
fviz_dist(dis_pearson)
This will give you the distance matrix and visualize it.
The output of kmeans
has several bits of information. The most important with regard to your question are:
totss:
the total sum of squareswithinss:
vector of within-cluster sum of squarestot.withinss:
total within-cluster sum of squaresbetweenss:
the between-cluster sum of squaresThus, the goal is to optimize these by playing with distances and other methods to cluster the data. Using cluster
package, you can simply extract these measures by mycluster <- kmeans(yourdataframe, centers = 2)
and then calling mycluster
.
Side comment: kmeans
requires the number of clusters defined by the user (additional effort) and it is very sensitive to outliers.
Upvotes: 3