Mapping KMeans cluster centers to the original dataframe

Question

The way I understood the cluster_centers_ attribute from SKL's KMeans algorithm is that those are essentially data points computed by the algorithm so that it minimizes the sum of the distances of all the other data points assigned to the same cluster.

Now, in my case cluster_centers_ returns an 4x13 array. So far, so good.

In [102]: k_means.cluster_centers_

Out[102]: array([[ 4.78931977e-01,  4.90762118e-01,  4.45716436e-01,
     4.06958828e-01,  1.75669885e-01,  7.20500999e-01,
     1.00000000e+00,  4.67334062e-01,  7.62096965e-01,
     3.26627062e-01,  1.11299030e-01,  1.00000000e+00,
     3.38983051e-03],
   [ 2.56178744e-01,  6.31538163e-01,  6.35222200e-01,
     5.50653164e-01,  1.95449906e-01,  8.42033556e-01,
    -8.28226376e-14,  4.86866204e-01,  7.88197801e-01,
     4.63464418e-01,  1.07503725e-01,  9.65338920e-14,
     8.80867977e-03],
   [ 3.00150863e-01,  6.07788520e-01,  6.05935644e-01,
     4.35146301e-01,  1.95530922e-01,  8.38422087e-01,
     1.00000000e+00,  4.89682837e-01,  7.78838601e-01,
     4.75986892e-01,  1.07519045e-01, -3.79418719e-14,
     9.14063961e-03],
   [ 4.27285065e-01,  5.13167435e-01,  5.00494859e-01,
     5.48965002e-01,  1.86222531e-01,  7.40201080e-01,
    -8.29336599e-14,  4.71366946e-01,  7.67300469e-01,
     3.33472857e-01,  1.12865093e-01,  1.00000000e+00,
     1.87793427e-03]])

As a next step I would like to assign the correct column names to the cluster center values since the array alone isn't telling me much.

However, when I try to create a new dataframe and assign the column names as per the original dataframe using the below code I can clearly see that the columns are not matching the values from the cluster_centers_ array (I did some comparison with the distribution of the original dataframe).

centers = pd.DataFrame(k_means.cluster_centers_)
df_centers = pd.DataFrame(centers, columns= df.columns)

It looks like the array returned by cluster_centers_ doesn't have the same order of features as the original dataframe.

Any idea how to map the array from cluster_centers_ so that it matches the order/ structure of the original dataframe used for the clustering?

P.S.: I did some standardization in the process but also inverted it back so that shouldn't be the issue.

Posting the fit/predict part as it was asked for in the comments

k_means.fit(df)
y_pred = k_means.predict(df)

EDIT: I messed up

After some digging in my notebook I found the issue:

So my machine learning process was like this

standardization (of the entire dataframe)
binarization (only 2 columns of my dataframe, followed by dropping those from the initial df and adding the new, binarized ones instead which messed up the feature order)
clustering (on this new dataframe)

So when I performed the inverse_transform method of my MinMaxScaler this was still using the old feature order (before I messed it up with my binarization thing).

Mapping KMeans cluster centers to the original dataframe

Answers (1)

Related Questions