mandi
mandi

Reputation: 55

KMeans Clustering only using specific Csv column

Following a tutorial I am learning on how to use Kmeans.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans



X = np.array([[1, 2],
              [5, 8],
              [1.5, 1.8],
              [8, 8],
              [1, 0.6],
              [9, 11]])


kmeans = KMeans(n_clusters=2 )
kmeans.fit(X)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)

colors = ["g.","r.","c.","y."]

for i in range(len(X)):
    print("coordinate:",X[i], "label:", labels[i])
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)


plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)

plt.show()

I want to read a csv file and then instead of the array which is used above, have one of the dataframe columns.

I tried the following but I did't work

df=pd.read_csv("Output.csv",encoding='latin1')
X=pd.DataFrame([['Column_1']]) 

I got the following error

ValueError: could not convert string to float: 'Column_1'

This is how my output looks when I use df.head

    x    id  ... Column_name v      Column_1
0  25  0001  ...         NaN             854
1  28  0002  ...         NaN            85,4
2  29  0003  ...         NaN            1524
3  32  NaN   ...         NaN               0
4  85  0004  ...         NaN               0

Upvotes: 2

Views: 1119

Answers (1)

M_S_N
M_S_N

Reputation: 2810

When you run following command as in your question

X=pd.DataFrame([['Column_1']]) 

X now holds this:

        0
0   Columns_1

The error is pretty clear as it is saying unable to convert 'Column_1' to float as kmeans uses numbers data

you can simplay select your first column as;

X=df[['your_first_col_name']]

Edit To handle commas you can use:

df['Column_1']=df['Column_1'].str.replace(',','.')

One more way to handle data that contains ',' instead of '.' for decimals as is the case with European format, is to declare decimal argument while reading csv so, if original data is like this:

A
1253
1253,5
12578,8
148,45
124589

we can read this data as

df=pd.read_csv('c2.csv', decimal=',')

and output will be

0      1253.00
1      1253.50
2     12578.80
3       148.45
4    124589.00
Name: A, dtype: float64

Upvotes: 2

Related Questions