Reputation: 327
I have a .csv
file that is created from an nd.array
after the input data is processed by sklearn.cluster.DBSCAN()
.I would like to be able to "tie" every point in the cluster to an unique identifier given by a column in my input file.
This is how I'm reading my input_data
:
# Generate sample data
col_1 ="RL15_LONGITUDE"
col_2 ="RL15_LATITUDE"
data = pd.read_csv("2004_Charley_data.csv")
coords = data.as_matrix(columns=[col_1, col_2])
data = data[[col_1,col_2]].dropna()
data = data.as_matrix().astype('float16',copy=False)
And this is what it looks like:
RecordID Storm RL15_LATITUDE RL15_LONGITUDE
2004_Charley95104-257448 2004_Charley 25.81774 -80.25079
2004_Charley93724-254950 2004_Charley 26.116338 -81.74986
2004_Charley93724-254949 2004_Charley 26.116338 -81.74986
2004_Charley75496-215198 2004_Charley 26.11817 -81.75756
With some help I was able to take the output of DBSCAN
and save it as a .CSV
file like this:
clusters = (pd.concat([pd.DataFrame(c, columns=[col_2,col_1]).assign(cluster=i)
for i,c in enumerate(clusters)])
.reset_index()
.rename(columns={'index':'point'})
.set_index(['cluster','point'])
)
clusters.to_csv('output.csv')
My output now is multi-index, but I would like to know if there's a way I could change the column point to RecordID
instead of just a number? :
cluster point RL15_LATITUDE RL15_LONGITUDE
0 0 -81.0625 29.234375
0 1 -81.0625 29.171875
0 2 -81.0625 29.359375
1 0 -81.0625 29.25
1 1 -81.0625 29.21875
1 2 -81.0625 29.25
1 3 -81.0625 29.21875
Upvotes: 3
Views: 1535
Reputation: 210832
Code:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn)
cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13
db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
df['cluster'] = labels
res = df[df.cluster >= 0]
print('--------------------------------')
print(res)
print('--------------------------------')
print(res.cluster.value_counts())
Output:
--------------------------------
RecordID Storm RL15_LATITUDE RL15_LONGITUDE cluster
5 2004_Charley73944-211787 2004_Charley 29.228560 -81.034440 0
13 2004_Charley72308-208134 2004_Charley 29.442692 -81.109528 0
18 2004_Charley68044-198941 2004_Charley 29.442692 -81.109528 0
19 2004_Charley67753-198272 2004_Charley 29.270940 -81.097300 0
22 2004_Charley64829-191531 2004_Charley 29.313223 -81.101620 0
.. ... ... ... ... ...
812 2004_Charley94314-256039 2004_Charley 28.287827 -81.353285 1
813 2004_Charley93913-255344 2004_Charley 26.532980 -82.194400 7
814 2004_Charley93913-255346 2004_Charley 27.210467 -81.863720 5
815 2004_Charley93913-255357 2004_Charley 26.935550 -82.054447 4
816 2004_Charley93913-255354 2004_Charley 26.935550 -82.054447 4
[688 rows x 5 columns]
--------------------------------
1 217
0 170
2 145
4 94
7 18
6 16
5 14
3 14
Name: cluster, dtype: int64
Old answer:
If i understood your code correctly you can do it this way:
# read CSV (you have provided space-delimited file and with one unnamed column, so i have converted it to somewhat similar to that from your question)
fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn, sep='\s+', index_col=0)
df.index = df.index.values + df.RecordID.map(str)
del df['RecordID']
first 10 rows:
In [148]: df.head(10)
Out[148]:
Storm RL15_LATITUDE RL15_LONGITUDE
RecordID
2004_Charley67146-196725 2004_Charley 33.807550 -78.701172
2004_Charley73944-211790 2004_Charley 33.618435 -78.993407
2004_Charley73944-211793 2004_Charley 28.609200 -80.818880
2004_Charley73944-211789 2004_Charley 29.383210 -81.160100
2004_Charley73944-211786 2004_Charley 33.691235 -78.895129
2004_Charley73944-211787 2004_Charley 29.228560 -81.034440
2004_Charley73944-211795 2004_Charley 28.357253 -80.701632
2004_Charley73944-211792 2004_Charley 34.204490 -77.924700
2004_Charley66636-195501 2004_Charley 33.436717 -79.132074
2004_Charley66631-195496 2004_Charley 33.646292 -78.977968
clustering:
cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13
db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
setting cluster info to our DF - we can simply assign it as labels
has the same length as our DF:
df['cluster'] = labels
filter: keep only those rows where cluster >= 0
:
res = df[df.cluster >= 0]
result:
In [152]: res.head(10)
Out[152]:
Storm RL15_LATITUDE RL15_LONGITUDE cluster
RecordID
2004_Charley73944-211787 2004_Charley 29.228560 -81.034440 0
2004_Charley72308-208134 2004_Charley 29.442692 -81.109528 0
2004_Charley68044-198941 2004_Charley 29.442692 -81.109528 0
2004_Charley67753-198272 2004_Charley 29.270940 -81.097300 0
2004_Charley64829-191531 2004_Charley 29.313223 -81.101620 0
2004_Charley67376-197429 2004_Charley 29.196990 -80.993800 0
2004_Charley73720-211013 2004_Charley 29.171450 -81.037170 0
2004_Charley73705-210991 2004_Charley 28.308746 -81.424273 1
2004_Charley65157-192371 2004_Charley 28.308746 -81.424273 1
2004_Charley65126-192326 2004_Charley 28.308746 -81.424273 1
stats:
In [151]: res.cluster.value_counts()
Out[151]:
1 217
0 170
2 145
4 94
7 18
6 16
5 14
3 14
Name: cluster, dtype: int64
if you don't want to have RecordID
as index:
In [153]: res = res.reset_index()
In [154]: res.head(10)
Out[154]:
RecordID Storm RL15_LATITUDE RL15_LONGITUDE cluster
0 2004_Charley73944-211787 2004_Charley 29.228560 -81.034440 0
1 2004_Charley72308-208134 2004_Charley 29.442692 -81.109528 0
2 2004_Charley68044-198941 2004_Charley 29.442692 -81.109528 0
3 2004_Charley67753-198272 2004_Charley 29.270940 -81.097300 0
4 2004_Charley64829-191531 2004_Charley 29.313223 -81.101620 0
5 2004_Charley67376-197429 2004_Charley 29.196990 -80.993800 0
6 2004_Charley73720-211013 2004_Charley 29.171450 -81.037170 0
7 2004_Charley73705-210991 2004_Charley 28.308746 -81.424273 1
8 2004_Charley65157-192371 2004_Charley 28.308746 -81.424273 1
9 2004_Charley65126-192326 2004_Charley 28.308746 -81.424273 1
Upvotes: 1