Srishti Aggarwal
Srishti Aggarwal

Reputation: 1

Clustering with mixed data type

Currently my data frame consist of both numerical and categorical values (mixed data type). My data frame looks like -

id       age      txn_duration        Statename        amount      gender     religion 
1         27        275                bihar            110          m         hindu
2         33        163               maharashtra       50           f         muslim
3         53         63               delhi             50           f         muslim
4         47        100               up                50           m         hindu
5         39        263               punjab            100          m         punjabi
6         41        303               delhi             50           m         punjabi

There is 20 states (Statename) and 7 religion. I have done get_dummies for both Statename and rekigion but got lots of noise. Also detect outlier.My question is - 1. how to find optimum no of cluster for mixed data type. 2. In this case I am using k-means algo.Can I use k-modes or any other methods which will help my results. Because I am not getting good results using k-means 3.How to interpretation my cluster results. I have use

print (cluster_data[clmns].groupby(['clusters']).mean())

Any other way I can see or plot?please provide me the code

My code is -

import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import LabelEncoder
import numpy as np
#Importing libraries
import os
import matplotlib.pyplot as plt#visualization
from PIL import  Image
%matplotlib inline
import seaborn as sns#visualization
import itertools
import warnings
warnings.filterwarnings("ignore")
import io
from scipy import stats
from sklearn.cluster import KMeans
from kmodes.kprototypes import KPrototypes

cluster_data = pd.read_csv("cluster.csv")

cluster_data = pd.get_dummies(cluster_data, columns=['StateName'])
cluster_data = pd.get_dummies(cluster_data, columns=['gender'])
cluster_data = pd.get_dummies(cluster_data, columns=['religion'])

clmns = ['mobile', 'age', 'txn_duration', 'amount', 'StateName_Bihar',
       'StateName_Delhi', 'StateName_Gujarat', 'StateName_Karnataka',
       'StateName_Maharashtra', 'StateName_Punjab', 'StateName_Rajasthan',
       'StateName_Telangana', 'StateName_Uttar Pradesh',
       'StateName_West Bengal', 'gender_female',
       'gender_male', 'religion_buddhist',
       'religion_christian', 'religion_hindu',
       'religion_jain', 'religion_muslim',
       'religion_other', 'religion_sikh']
df_tr_std = stats.zscore(cluster_data[clmns])

#Cluster the data
kmeans = KMeans(n_clusters=3, random_state=0).fit(df_tr_std)
labels = kmeans.labels_

#Glue back to originaal data
cluster_data['clusters'] = labels

clmns.extend(['clusters'])

#Lets analyze the clusters
print (cluster_data[clmns].groupby(['clusters']).mean())

Upvotes: 0

Views: 1919

Answers (1)

user2677285
user2677285

Reputation: 313

You can run something like this code: Look at the image attached, in that plot you can see that having more than 3 clusters (for the dataset it was run on) does not provide a significant decrease in distortion. So optimum cluster number would be 3 in that case (simple synthetic data). For noisy data the decision might be harder.

Reference: A. Mueller's scipy notes on sklearn

import matplotlib.pyplot as plt
distortions = []
for i in range(1, 11):
    km = KMeans(n_clusters=i, 
                random_state=0)
    km.fit(X)
    distortions.append(km.inertia_)

plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()

Edit for ValueError: For ValueError: you need just numericals, so you can do like this:

df_numerics = df.drop(['Statename', 'gender', 'religion], axis=1)

You can also drop other columns that you don't want included in clustering analysis.

with df_numerics, try the elbow method and try to find a good cluster number.

Then, let's say you found out that 3 clusters was good, you can run:

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

labels contains the cluster numbers (0,1,2 for 3-clusters) for each row in your dataframe.You can also save this as a column in you datafame:

df['cluster_labels'] = labels

Then to visualize it you can pick 2 columns (more than that is dificult to visualize). Let's say you picked 'txn_duration' and 'amount' you can plot those columns, and add the cluster labels as color like this:

import matplotlib.pyplot as plt
plt.scatter(df['txn_duration'],df['amount'], c=df['cluster_labels'])

elbow method

Upvotes: 0

Related Questions