Bok
Bok

Reputation: 587

K means cluster columns

I have been given the following task:
You want to reduce the amount of field sensors to 20. You should now have from the previous question, an array with all your loading vectors (pca.components_), one vector per principal component, with 137 elements (one per each sensor). Use clustering to group sensors that behave the same.

My data: consists of 137 different sensors and 8784 lines.

After i standardized my data, i see that 16 columns has a standard deviation of 0, and therefore remove them (This would mean they measure the same every time, right?)

I run the following code:

from sklearn.decomposition import PCA

# Do your PCA here.
pca = PCA(n_components=120)
pca.fit(data['std'])

from sklearn.cluster import KMeans
X_pca = pca.transform(data['std'])

# Apply your clustering here
km = KMeans(n_clusters=20, init='k-means++',n_init=10, verbose=0);
km.fit(X_pca);
cluster_pred = km.predict(X_pca);

plt.figure(figsize=(10,5))
plt.scatter(X_pca[:,0], X_pca[:,1], c=cluster_pred, s=20, cmap='viridis')
plt.show()

Now i end up with all the rows being clustered. How do i change this, to cluster each column, so i can select a sensor from each cluster?
And for selection, should i just take the center of each cluster?

Upvotes: 1

Views: 1745

Answers (1)

caspillaga
caspillaga

Reputation: 563

Im not sure how data['std'] looks like, so I was unable to run your code. Anyway, following what you say, your problem may be solved by transposing your data as follows:

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import numpy as np

transposed_data = np.transpose(data['std']) 
# In case it doesnt work, try with np.transpose(np.asarray(data['std']))

# Do your PCA here.
pca = PCA(n_components=120)
pca.fit(transposed_data)

X_pca = pca.transform(transposed_data)

# Apply your clustering here
km = KMeans(n_clusters=20, init='k-means++',n_init=10, verbose=0);
km.fit(X_pca);
cluster_pred = km.predict(X_pca);

plt.figure(figsize=(10,5))
plt.scatter(X_pca[:,0], X_pca[:,1], c=cluster_pred, s=20, cmap='viridis')
plt.show()

Upvotes: 1

Related Questions