Reputation: 143
I am using the KMeans from sklearn to cluster the College.csv. But when I fit the KMeans model, my dataset changes after that! Before using KMeans, I Standardize the numerical variables with StandardScaler
and I use OneHotEncoder
to dummy the categorical variable "Private"
.
My code is:
num_vars = data.columns[1:]
scaler = StandardScaler()
data[num_vars] = scaler.fit_transform(data[num_vars])
ohe = OneHotEncoder()
data["Private"] = ohe.fit_transform(data.Private.values.reshape(-1,1)).toarray()
km = KMeans(n_cluster = 6)
km.fit(data)
The dataset before using the KMeans:
The dataset after using the KMeans:
Upvotes: 2
Views: 119
Reputation: 19545
It appears that when you run km.fit(data)
, the .fit
method modifies data
inplace by inserting a column that is the opposite of your one-hot encoded column. And also confusing is the fact that the "Terminal"
column disappears.
For now, you can use this workaround that copies your data:
data1 = data.copy()
km = KMeans(n_clusters = 6, n_init = 'auto')
km.fit(data1)
Edit: When you run km.fit
, the first method that is run is km._validate_data
– which is a validation step that modifies the dataframe that you pass (see here and here)
For example, if I add the following to the end of your code:
km._validate_data(
data,
accept_sparse="csr",
dtype=[np.float64, np.float32],
order="C",
accept_large_sparse=False,
)
Running this changes your data, but I don't know exactly why this is happening. It may have to do with something about the data itself.
Upvotes: 3
Reputation: 4273
There's a subtle bug in the posted code. Let's demonstrate it:
new_df = pd.DataFrame({"Private": ["Yes", "Yes", "No"]})
OneHotEncoder
returns something like this:
new_data = np.array(
[[0, 1],
[0, 1],
[1, 0]])
What happens if we assign new_df["Private"]
with our new (3, 2)
array?
>>> new_df["Private"] = new_data
>>> print(new_df)
Private
0 0
1 0
2 1
Wait, where'd the other column go?
Uh oh, it's still in there ...
... but it's invisible until we look at the actual values:
>>> print(new_df.values)
[[0 1]
[0 1]
[1 0]]
As @Derek hinted in his answer, KMeans has to validate the data, which usually converts from pandas dataframes into the underlying arrays. When this happens, all your "columns" get shifted to the right by one because there was an invisible column created by the OneHotEncoder
.
Is there a better way?
Yep, use a pipeline!
pipe = make_pipeline(
ColumnTransformer(
transformers=[
("ohe", OrdinalEncoder(categories=[["No", "Yes"]]), ["Private"]),
],
remainder=StandardScaler(),
),
KMeans(n_clusters=6),
)
out = pipe.fit(df)
Upvotes: 2
Reputation: 81
The data is the same but shifted over by one column. The Apps column never existed before and everything is shifted to the right. It has something to do with your line data[num_vars] = scaler.fit_transform(data[num_vars]) which is actually doing a nested double array data[data[columns[1:]].
Basically, you can follow a method like this
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
data[:, 1:] = sc.fit_transform(data[:, 1:])
Upvotes: 0