Reputation: 55
Following a tutorial I am learning on how to use Kmeans.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans
X = np.array([[1, 2],
[5, 8],
[1.5, 1.8],
[8, 8],
[1, 0.6],
[9, 11]])
kmeans = KMeans(n_clusters=2 )
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["g.","r.","c.","y."]
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()
I want to read a csv file and then instead of the array which is used above, have one of the dataframe columns.
I tried the following but I did't work
df=pd.read_csv("Output.csv",encoding='latin1')
X=pd.DataFrame([['Column_1']])
I got the following error
ValueError: could not convert string to float: 'Column_1'
This is how my output looks when I use df.head
x id ... Column_name v Column_1
0 25 0001 ... NaN 854
1 28 0002 ... NaN 85,4
2 29 0003 ... NaN 1524
3 32 NaN ... NaN 0
4 85 0004 ... NaN 0
Upvotes: 2
Views: 1119
Reputation: 2810
When you run following command as in your question
X=pd.DataFrame([['Column_1']])
X now holds this:
0
0 Columns_1
The error is pretty clear as it is saying unable to convert 'Column_1' to float as kmeans
uses numbers data
you can simplay select your first column as;
X=df[['your_first_col_name']]
Edit To handle commas you can use:
df['Column_1']=df['Column_1'].str.replace(',','.')
One more way to handle data that contains ','
instead of '.'
for decimals as is the case with European format, is to declare decimal
argument while reading csv
so, if original data is like this:
A
1253
1253,5
12578,8
148,45
124589
we can read this data as
df=pd.read_csv('c2.csv', decimal=',')
and output will be
0 1253.00
1 1253.50
2 12578.80
3 148.45
4 124589.00
Name: A, dtype: float64
Upvotes: 2