Reputation: 1
Currently my data frame consist of both numerical and categorical values (mixed data type). My data frame looks like -
id age txn_duration Statename amount gender religion
1 27 275 bihar 110 m hindu
2 33 163 maharashtra 50 f muslim
3 53 63 delhi 50 f muslim
4 47 100 up 50 m hindu
5 39 263 punjab 100 m punjabi
6 41 303 delhi 50 m punjabi
There is 20 states (Statename) and 7 religion. I have done get_dummies for both Statename and rekigion but got lots of noise. Also detect outlier.My question is - 1. how to find optimum no of cluster for mixed data type. 2. In this case I am using k-means algo.Can I use k-modes or any other methods which will help my results. Because I am not getting good results using k-means 3.How to interpretation my cluster results. I have use
print (cluster_data[clmns].groupby(['clusters']).mean())
Any other way I can see or plot?please provide me the code
My code is -
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
#Importing libraries
import os
import matplotlib.pyplot as plt#visualization
from PIL import Image
%matplotlib inline
import seaborn as sns#visualization
import itertools
import warnings
warnings.filterwarnings("ignore")
import io
from scipy import stats
from sklearn.cluster import KMeans
from kmodes.kprototypes import KPrototypes
cluster_data = pd.read_csv("cluster.csv")
cluster_data = pd.get_dummies(cluster_data, columns=['StateName'])
cluster_data = pd.get_dummies(cluster_data, columns=['gender'])
cluster_data = pd.get_dummies(cluster_data, columns=['religion'])
clmns = ['mobile', 'age', 'txn_duration', 'amount', 'StateName_Bihar',
'StateName_Delhi', 'StateName_Gujarat', 'StateName_Karnataka',
'StateName_Maharashtra', 'StateName_Punjab', 'StateName_Rajasthan',
'StateName_Telangana', 'StateName_Uttar Pradesh',
'StateName_West Bengal', 'gender_female',
'gender_male', 'religion_buddhist',
'religion_christian', 'religion_hindu',
'religion_jain', 'religion_muslim',
'religion_other', 'religion_sikh']
df_tr_std = stats.zscore(cluster_data[clmns])
#Cluster the data
kmeans = KMeans(n_clusters=3, random_state=0).fit(df_tr_std)
labels = kmeans.labels_
#Glue back to originaal data
cluster_data['clusters'] = labels
clmns.extend(['clusters'])
#Lets analyze the clusters
print (cluster_data[clmns].groupby(['clusters']).mean())
Upvotes: 0
Views: 1919
Reputation: 313
You can run something like this code: Look at the image attached, in that plot you can see that having more than 3 clusters (for the dataset it was run on) does not provide a significant decrease in distortion. So optimum cluster number would be 3 in that case (simple synthetic data). For noisy data the decision might be harder.
Reference: A. Mueller's scipy notes on sklearn
import matplotlib.pyplot as plt
distortions = []
for i in range(1, 11):
km = KMeans(n_clusters=i,
random_state=0)
km.fit(X)
distortions.append(km.inertia_)
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()
Edit for ValueError: For ValueError: you need just numericals, so you can do like this:
df_numerics = df.drop(['Statename', 'gender', 'religion], axis=1)
You can also drop other columns that you don't want included in clustering analysis.
with df_numerics, try the elbow method and try to find a good cluster number.
Then, let's say you found out that 3 clusters was good, you can run:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
labels contains the cluster numbers (0,1,2 for 3-clusters) for each row in your dataframe.You can also save this as a column in you datafame:
df['cluster_labels'] = labels
Then to visualize it you can pick 2 columns (more than that is dificult to visualize). Let's say you picked 'txn_duration' and 'amount' you can plot those columns, and add the cluster labels as color like this:
import matplotlib.pyplot as plt
plt.scatter(df['txn_duration'],df['amount'], c=df['cluster_labels'])
Upvotes: 0