Reputation: 121
I'm doing a k-means clustering, with a fixed amount of groups, of several geographical lat/lon points. This basic clustering works just fine.
But i have another variable (one for each point) which i would like the k-means clustering to account for. Is this possible somehow?
The clustering data could look like this:
Lat: [1.23, 2.12, 3.65, 4.32, 5.63, 5.43]
Lon: [1.43, 2.43, 3.76, 4.43, 5.25, 1.75]
Extra variable: [20, 20, 10, 10, 10, 10]
If i want the above data divided into 2 groups, and the sum of the extra variable for each group cannot exceed the sum of 40, how would i go about this? (If it's possible at all - My understanding of k-means is pretty basic/low-end.)
Upvotes: 0
Views: 669
Reputation: 20302
Ok, so just add in the extra feature and run it.
data = np.asarray([np.asarray(df['Lat']),np.asarray(df['Lon']),np.asarray(df['Extra variable'])])
See the link below for more info.
https://www.pythonforfinance.net/2018/02/08/stock-clusters-using-k-means-algorithm-in-python/
Upvotes: 0
Reputation: 3938
It seems that this is no longer a basic clustering application but rather an optimization problem with constraints. In words, you are hoping to accomplish:
Minimize the total distance (in lat lon) between points grouped into cluster 1 and the points grouped into cluster 2
subject to the constraint that the sum of Extra variable in cluster 1 and in cluster 2 is less than 40 for each cluster.
This is a nonlinear program, so you'd have to use a non-linear optimization tool to solve this.
Alternatively, depending on the size of the data, you could modify k-means clustering such that it continues to shift cluster centroids and reassign data points, but detect when a data reassignment will push a cluster over the limit for the sum of Extra variable. In this case, you could move the centroid of the cluster randomly instead. Keep track of the best set of clusters (some combination of low intracluster distance and high intercluster difference), and after some amount of time, use the best set of clusters obtained by this method.
Upvotes: 2