Clustering with mixed data type

Question

Currently my data frame consist of both numerical and categorical values (mixed data type). My data frame looks like -

id       age      txn_duration        Statename        amount      gender     religion 
1         27        275                bihar            110          m         hindu
2         33        163               maharashtra       50           f         muslim
3         53         63               delhi             50           f         muslim
4         47        100               up                50           m         hindu
5         39        263               punjab            100          m         punjabi
6         41        303               delhi             50           m         punjabi

There is 20 states (Statename) and 7 religion. I have done get_dummies for both Statename and rekigion but got lots of noise. Also detect outlier.My question is - 1. how to find optimum no of cluster for mixed data type. 2. In this case I am using k-means algo.Can I use k-modes or any other methods which will help my results. Because I am not getting good results using k-means 3.How to interpretation my cluster results. I have use

print (cluster_data[clmns].groupby(['clusters']).mean())

Any other way I can see or plot?please provide me the code

My code is -

import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import LabelEncoder
import numpy as np
#Importing libraries
import os
import matplotlib.pyplot as plt#visualization
from PIL import  Image
%matplotlib inline
import seaborn as sns#visualization
import itertools
import warnings
warnings.filterwarnings("ignore")
import io
from scipy import stats
from sklearn.cluster import KMeans
from kmodes.kprototypes import KPrototypes

cluster_data = pd.read_csv("cluster.csv")

cluster_data = pd.get_dummies(cluster_data, columns=['StateName'])
cluster_data = pd.get_dummies(cluster_data, columns=['gender'])
cluster_data = pd.get_dummies(cluster_data, columns=['religion'])

clmns = ['mobile', 'age', 'txn_duration', 'amount', 'StateName_Bihar',
       'StateName_Delhi', 'StateName_Gujarat', 'StateName_Karnataka',
       'StateName_Maharashtra', 'StateName_Punjab', 'StateName_Rajasthan',
       'StateName_Telangana', 'StateName_Uttar Pradesh',
       'StateName_West Bengal', 'gender_female',
       'gender_male', 'religion_buddhist',
       'religion_christian', 'religion_hindu',
       'religion_jain', 'religion_muslim',
       'religion_other', 'religion_sikh']
df_tr_std = stats.zscore(cluster_data[clmns])

#Cluster the data
kmeans = KMeans(n_clusters=3, random_state=0).fit(df_tr_std)
labels = kmeans.labels_

#Glue back to originaal data
cluster_data['clusters'] = labels

clmns.extend(['clusters'])

#Lets analyze the clusters
print (cluster_data[clmns].groupby(['clusters']).mean())

Clustering with mixed data type

Answers (1)

Related Questions