Marcin Miśkowiec
Marcin Miśkowiec

Reputation: 3

'K-means' cluster analysis

I want to get values ​​such as mean, min, max. std dev. for each group of clusters calculated using the k-means method. Is the code below correct?

    import pandas as pd
    from sklearn.cluster import KMeans

    dataset = pd.read_csv("C:/Users/../cardio_train_py.csv", sep=';')    
    clusterDB_1 = dataset[['Age','BMI','cardio']].copy()
    kmeans = KMeans(n_clusters=8).fit(clusterDB_1)
    
    X=[0,1,2,3,4,5,6,7]
    print('Age mean() for each cluster')
    for x in X:
        check = clusterDB_1[kmeans.labels_ == x]
        print(check['Age'].mean())
    print('BMI mean() for each cluster')
    for x in X:
        check = clusterDB_1[kmeans.labels_ == x]
        print(check['BMI'].mean())
    print('cardio == 0 count() for each cluster')
    
    for x in X:
        check = clusterDB_1[kmeans.labels_ == x]
        print(len(check[check['cardio'] == 1]))

I'm asking because the obtained values ​​(e.g. mean for Age and BMI and cardio count == 0) is different than values obtained in Statistica(the photo shows the results of the program Statistica results) Below is the result of BMI (Python calculation)

24.468587736260996
24.047855933307282
30.548865468674116
31.98410463004993
32.89129084635681
166.57357142857146
41.97845737483085
24.16813400017246

here is my database => https://www.easypaste.org/file/JcyGhA8Y/cardio.train.py.csv?lang=pl

Thanks for all help and tips :)

Upvotes: 0

Views: 93

Answers (1)

DYZ
DYZ

Reputation: 57105

The following will do just what you want, in one line:

clusterDB_1.groupby(kmeans.labels_).mean()

Upvotes: 1

Related Questions