Tomi Gelo
Tomi Gelo

Reputation: 97

Mapping KMeans cluster centers to the original dataframe

The way I understood the cluster_centers_ attribute from SKL's KMeans algorithm is that those are essentially data points computed by the algorithm so that it minimizes the sum of the distances of all the other data points assigned to the same cluster.

Now, in my case cluster_centers_ returns an 4x13 array. So far, so good.

In [102]: k_means.cluster_centers_

Out[102]: array([[ 4.78931977e-01,  4.90762118e-01,  4.45716436e-01,
     4.06958828e-01,  1.75669885e-01,  7.20500999e-01,
     1.00000000e+00,  4.67334062e-01,  7.62096965e-01,
     3.26627062e-01,  1.11299030e-01,  1.00000000e+00,
     3.38983051e-03],
   [ 2.56178744e-01,  6.31538163e-01,  6.35222200e-01,
     5.50653164e-01,  1.95449906e-01,  8.42033556e-01,
    -8.28226376e-14,  4.86866204e-01,  7.88197801e-01,
     4.63464418e-01,  1.07503725e-01,  9.65338920e-14,
     8.80867977e-03],
   [ 3.00150863e-01,  6.07788520e-01,  6.05935644e-01,
     4.35146301e-01,  1.95530922e-01,  8.38422087e-01,
     1.00000000e+00,  4.89682837e-01,  7.78838601e-01,
     4.75986892e-01,  1.07519045e-01, -3.79418719e-14,
     9.14063961e-03],
   [ 4.27285065e-01,  5.13167435e-01,  5.00494859e-01,
     5.48965002e-01,  1.86222531e-01,  7.40201080e-01,
    -8.29336599e-14,  4.71366946e-01,  7.67300469e-01,
     3.33472857e-01,  1.12865093e-01,  1.00000000e+00,
     1.87793427e-03]])

As a next step I would like to assign the correct column names to the cluster center values since the array alone isn't telling me much.

However, when I try to create a new dataframe and assign the column names as per the original dataframe using the below code I can clearly see that the columns are not matching the values from the cluster_centers_ array (I did some comparison with the distribution of the original dataframe).

centers = pd.DataFrame(k_means.cluster_centers_)
df_centers = pd.DataFrame(centers, columns= df.columns)

It looks like the array returned by cluster_centers_ doesn't have the same order of features as the original dataframe.

Any idea how to map the array from cluster_centers_ so that it matches the order/ structure of the original dataframe used for the clustering?

P.S.: I did some standardization in the process but also inverted it back so that shouldn't be the issue.

Posting the fit/predict part as it was asked for in the comments

k_means.fit(df)
y_pred = k_means.predict(df)

EDIT: I messed up

After some digging in my notebook I found the issue:

So my machine learning process was like this

So when I performed the inverse_transform method of my MinMaxScaler this was still using the old feature order (before I messed it up with my binarization thing).

Upvotes: 2

Views: 4283

Answers (1)

ignoring_gravity
ignoring_gravity

Reputation: 10501

Are you sure it's inverting the order of the features?

It's impossible to check your code, as you haven't provided a minimal working example, but I just tried this:

from sklearn.cluster import KMeans
import numpy as np
X = np.array([[0, 1], [2, 3]])
for i in range(100):
    kmeans = KMeans(n_clusters=2, random_state=i).fit(X)
    print(kmeans.cluster_centers_)

and got that the order of the features was preserved every single time.

Upvotes: 2

Related Questions