Reputation: 1043
I am trying to do clustering for the data-frame given to me. It has 14 columns. How to do clustering for 8 of those?
Below is the code that I found and followed.
Elbow method:
Visualization
# K-Means Clustering
# importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# importing tha customer Expenses Invoices dataset with pandas
dataset=pd.read_csv('Expense_Invoice.csv')
X=dataset.iloc[: , [3,2]].values
# Using the elbow method to find the optimal number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans=KMeans(n_clusters=i, init='k-means++', max_iter= 300, n_init= 10, random_state= 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters K')
plt.ylabel('Average Within-Cluster distance to Centroid (WCSS)')
plt.show()
# Applying k-means to the mall dataset
kmeans=KMeans(n_clusters=3, init='k-means++', max_iter= 300, n_init= 10, random_state= 0)
y_kmeans=kmeans.fit_predict(X)
# Visualizing the clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label='Careful(c1)')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label='Standard(c2)')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label='Target(c3)')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 250, c = 'yellow',
label='Centroids')
plt.title('Clusters of customer Invoices & Expenses')
plt.xlabel('Total Invoices ')
plt.ylabel('Total Expenses')
plt.legend()
plt.show()
This works perfectly but this is only for two columns(variables), i want to have it for 8 column. But I could not understand how?
Upvotes: 3
Views: 8669
Reputation: 96
With X=dataset.iloc[: , [3,2]].values
you are specifically the 4th and 3rd column.
KMeans performs the clustering on all columns you selected.
Therefore you need to change X=dataset.iloc[: , [3,2]]
to your needs. Eg to use the first 8 columns of your dataset: X=dataset.iloc[:, 0:8].values
.
Take a look at pandas documentation for more options how to select data in dataframes: https://pandas.pydata.org/pandas-docs/stable/indexing.html
Keep in mind that you can't visualize your clusters in a 2D scatter plot as you have done before.
Upvotes: 2