user3qpu
user3qpu

Reputation: 67

Feature agglomeration: How to retrieve the features that make up the clusters?

I am using the scikit-learn's feature agglomeration to use a hierarchical clustering procedure on features rather than on the observations.

This is my code:

from sklearn import cluster
import pandas as pd

#load the data
df = pd.read_csv('C:/Documents/data.csv')
agglo = cluster.FeatureAgglomeration(n_clusters=5)
agglo.fit(df)
df_reduced = agglo.transform(df)

My original df had the shape (990, 15), after using feature agglomeration, df_reduced now has (990, 5).

How do now find out how the original 15 features have been clustered together? In other words, what original features from df make up each of the 5 new features in df_reduced?

Upvotes: 2

Views: 3226

Answers (2)

Jojo
Jojo

Reputation: 477

The way how the features within each of the clusters are combined during transform is set by the way you perform the hierarchical clustering. The reduced feature set simply consists of the n_clusters cluster centers (which are n_samples - dimensional vectors). For certain applications you might think of computing centers manually using different definitions of cluster centers (i.e. median instead of mean to avoid the influence of outliers etc.).

n_features = 15
feature_identifier = range(n_features)
feature_groups = [np.array(feature_identifier )[agglo.labels_==i] for i in range(n_clusters)]
new_features = [df.loc[:,df.keys()[group]].mean(0) for group in feature_groups]

Don't forget to standardize the features beforehand (for example using sklearn's scaler). Otherwise you are rather grouping the scales of the quantities than clustering similar behavior. Hope that helps! Haven't tested the code. Let me know if there are problems.

Upvotes: 2

σηγ
σηγ

Reputation: 1324

After fitting the clusterer, agglo.labels_ contains a list that tells in which cluster in the reduced dataset each feature in the original dataset belongs.

Upvotes: 1

Related Questions