Reputation: 587
I have been given the following task:
You want to reduce the amount of field sensors to 20. You should now have from the previous question, an array with all your loading vectors (pca.components_), one vector per principal component, with 137 elements (one per each sensor). Use clustering to group sensors that behave the same.
My data: consists of 137 different sensors and 8784 lines.
After i standardized my data, i see that 16 columns has a standard deviation of 0, and therefore remove them (This would mean they measure the same every time, right?)
I run the following code:
from sklearn.decomposition import PCA
# Do your PCA here.
pca = PCA(n_components=120)
pca.fit(data['std'])
from sklearn.cluster import KMeans
X_pca = pca.transform(data['std'])
# Apply your clustering here
km = KMeans(n_clusters=20, init='k-means++',n_init=10, verbose=0);
km.fit(X_pca);
cluster_pred = km.predict(X_pca);
plt.figure(figsize=(10,5))
plt.scatter(X_pca[:,0], X_pca[:,1], c=cluster_pred, s=20, cmap='viridis')
plt.show()
Now i end up with all the rows being clustered. How do i change this, to cluster each column, so i can select a sensor from each cluster?
And for selection, should i just take the center of each cluster?
Upvotes: 1
Views: 1745
Reputation: 563
Im not sure how data['std'] looks like, so I was unable to run your code. Anyway, following what you say, your problem may be solved by transposing your data as follows:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import numpy as np
transposed_data = np.transpose(data['std'])
# In case it doesnt work, try with np.transpose(np.asarray(data['std']))
# Do your PCA here.
pca = PCA(n_components=120)
pca.fit(transposed_data)
X_pca = pca.transform(transposed_data)
# Apply your clustering here
km = KMeans(n_clusters=20, init='k-means++',n_init=10, verbose=0);
km.fit(X_pca);
cluster_pred = km.predict(X_pca);
plt.figure(figsize=(10,5))
plt.scatter(X_pca[:,0], X_pca[:,1], c=cluster_pred, s=20, cmap='viridis')
plt.show()
Upvotes: 1