Reputation: 9
I am new to machine learning and the k-means algorithm entirely. After searching quite a bit, I've determined that I can either use an Elbow, Silhouette or Gap statistic method when trying to find the right k for k-means. The issue is that each graph gives me a vastly different output. The data is for one user's location with latitude and longitude and scaling has little to no effect because all of the locations are practically in the same 50 mile radius.
This is the code that I used in R:
#Determining the right number of clusters for each user beginning with UserId = 2949
la <- user2949$Latitude
lo<-user2949$Longitude
p <- cbind(la,lo)
s <- scale(p)
head(s)
#Using Elbow Method
Elbow <- fviz_nbclust(p,kmeans,method = "wss")+labs(subtitle = "Elbow Method")
Elbow
#Using Silhouette Method
Silhouette <- fviz_nbclust(p,kmeans,method = "silhouette")+labs(subtitle = "Silhouette Method")
Silhouette
#Using Gap Statistic
set.seed(123)
Gap <- fviz_nbclust(p,kmeans,nstart=25,method = "gap_stat",nboot=50)+labs(subtitle = "Gap Statistic Method",K.max = 20)
Gap
The output (these are only in Links because I apparently cannot post photos without a reputation of 10): - Another issue for me is determining the bend, I heard I should look into BIC, but don't know how to address this. I concluded from looking at it that the optimal number of clusters is likely 6, - This method says 10, which is probably not feasible for what I am trying to do given the sheer volume of number of users, - Gap statistic says 1 cluster is enough. I don't know what is misleading and what is not because I do not have expert knowledge on each of the methods.
The ultimate goal of this project is to look at all user locations and determine where their "home" is based on their activity (which is picked up by beacons in fast food restaurants). I am trying to find a large scale way to determine user location for almost 70,000 users. My initial thought was to make a loop using the most effective of these methods and user the centers of the clusters as possible home locations... What code can I use that will give me the correct number of clusters without having to look at 70,000 graphs?
Upvotes: 0
Views: 2750
Reputation: 77454
If these heuristics contradict each other, this usually means the k-means algorithm failed, and no k is good. It is not a very robust algorithm, it is sensitive to outliers.
You need to improve processing, and reconsider your assumptions about what the similarity is, and what a cluster is.
Upvotes: 0