sargupta
sargupta

Reputation: 1043

How can I do KMeans clustering in python for 8 columns in a data-frame of 14 columns?

I am trying to do clustering for the data-frame given to me. It has 14 columns. How to do clustering for 8 of those?

Below is the code that I found and followed.

Elbow method:

Elbow_method

Visualization

Visualization

# K-Means Clustering

# importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# importing tha customer Expenses Invoices dataset with pandas
dataset=pd.read_csv('Expense_Invoice.csv')
X=dataset.iloc[: , [3,2]].values

# Using the elbow method to find  the optimal number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
  kmeans=KMeans(n_clusters=i, init='k-means++', max_iter= 300, n_init= 10, random_state= 0)
  kmeans.fit(X)
  wcss.append(kmeans.inertia_)
plt.plot(range(1, 11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters K')
plt.ylabel('Average Within-Cluster distance to Centroid (WCSS)')  
plt.show()

# Applying k-means to the mall dataset
kmeans=KMeans(n_clusters=3, init='k-means++', max_iter= 300, n_init= 10, random_state= 0)
y_kmeans=kmeans.fit_predict(X)

# Visualizing the clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label='Careful(c1)')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label='Standard(c2)')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label='Target(c3)')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 250, c = 'yellow', 
            label='Centroids')
plt.title('Clusters of customer Invoices & Expenses')
plt.xlabel('Total Invoices ')
plt.ylabel('Total Expenses')
plt.legend()
plt.show()

This works perfectly but this is only for two columns(variables), i want to have it for 8 column. But I could not understand how?

Upvotes: 3

Views: 8669

Answers (1)

Markus Hennerbichler
Markus Hennerbichler

Reputation: 96

With X=dataset.iloc[: , [3,2]].values you are specifically the 4th and 3rd column. KMeans performs the clustering on all columns you selected.

Therefore you need to change X=dataset.iloc[: , [3,2]] to your needs. Eg to use the first 8 columns of your dataset: X=dataset.iloc[:, 0:8].values.

Take a look at pandas documentation for more options how to select data in dataframes: https://pandas.pydata.org/pandas-docs/stable/indexing.html

Keep in mind that you can't visualize your clusters in a 2D scatter plot as you have done before.

Upvotes: 2

Related Questions