rubito
rubito

Reputation: 327

How to assign unique identifier to DataFrame row

I have a .csv file that is created from an nd.array after the input data is processed by sklearn.cluster.DBSCAN().I would like to be able to "tie" every point in the cluster to an unique identifier given by a column in my input file.

This is how I'm reading my input_data:

# Generate sample data
col_1 ="RL15_LONGITUDE"
col_2 ="RL15_LATITUDE"  
data = pd.read_csv("2004_Charley_data.csv")
coords = data.as_matrix(columns=[col_1, col_2])
data = data[[col_1,col_2]].dropna()
data = data.as_matrix().astype('float16',copy=False)

And this is what it looks like:

RecordID                  Storm         RL15_LATITUDE   RL15_LONGITUDE
2004_Charley95104-257448  2004_Charley  25.81774        -80.25079
2004_Charley93724-254950  2004_Charley  26.116338       -81.74986
2004_Charley93724-254949  2004_Charley  26.116338       -81.74986
2004_Charley75496-215198  2004_Charley  26.11817        -81.75756

With some help I was able to take the output of DBSCAN and save it as a .CSV file like this:

clusters = (pd.concat([pd.DataFrame(c, columns=[col_2,col_1]).assign(cluster=i)
for i,c in enumerate(clusters)])
.reset_index()
.rename(columns={'index':'point'})
.set_index(['cluster','point'])
)
clusters.to_csv('output.csv')

My output now is multi-index, but I would like to know if there's a way I could change the column point to RecordID instead of just a number? :

cluster point   RL15_LATITUDE   RL15_LONGITUDE
0       0   -81.0625    29.234375
0       1   -81.0625    29.171875
0       2   -81.0625    29.359375
1       0   -81.0625    29.25
1       1   -81.0625    29.21875
1       2   -81.0625    29.25
1       3   -81.0625    29.21875

Upvotes: 3

Views: 1535

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

UPDATE:

Code:

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn)

cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13

db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

df['cluster'] = labels

res = df[df.cluster >= 0]

print('--------------------------------')
print(res)
print('--------------------------------')
print(res.cluster.value_counts())

Output:

--------------------------------
                     RecordID         Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
5    2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
13   2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
18   2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
19   2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
22   2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
..                        ...           ...            ...             ...      ...
812  2004_Charley94314-256039  2004_Charley      28.287827      -81.353285        1
813  2004_Charley93913-255344  2004_Charley      26.532980      -82.194400        7
814  2004_Charley93913-255346  2004_Charley      27.210467      -81.863720        5
815  2004_Charley93913-255357  2004_Charley      26.935550      -82.054447        4
816  2004_Charley93913-255354  2004_Charley      26.935550      -82.054447        4

[688 rows x 5 columns]
--------------------------------
1    217
0    170
2    145
4     94
7     18
6     16
5     14
3     14
Name: cluster, dtype: int64

Old answer:

If i understood your code correctly you can do it this way:

# read CSV (you have provided space-delimited file and with one unnamed column, so i have converted it to somewhat similar to that from your question)
fn = r'D:\temp\.data\2004_Charley_data.csv'

df = pd.read_csv(fn, sep='\s+', index_col=0)
df.index = df.index.values + df.RecordID.map(str)
del df['RecordID']

first 10 rows:

In [148]: df.head(10)
Out[148]:
                                 Storm  RL15_LATITUDE  RL15_LONGITUDE
RecordID
2004_Charley67146-196725  2004_Charley      33.807550      -78.701172
2004_Charley73944-211790  2004_Charley      33.618435      -78.993407
2004_Charley73944-211793  2004_Charley      28.609200      -80.818880
2004_Charley73944-211789  2004_Charley      29.383210      -81.160100
2004_Charley73944-211786  2004_Charley      33.691235      -78.895129
2004_Charley73944-211787  2004_Charley      29.228560      -81.034440
2004_Charley73944-211795  2004_Charley      28.357253      -80.701632
2004_Charley73944-211792  2004_Charley      34.204490      -77.924700
2004_Charley66636-195501  2004_Charley      33.436717      -79.132074
2004_Charley66631-195496  2004_Charley      33.646292      -78.977968

clustering:

cols = ['RL15_LONGITUDE','RL15_LATITUDE']

eps_=4
min_samples_=13

db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

setting cluster info to our DF - we can simply assign it as labels has the same length as our DF:

df['cluster'] = labels

filter: keep only those rows where cluster >= 0:

res = df[df.cluster >= 0]

result:

In [152]: res.head(10)
Out[152]:
                                 Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
RecordID
2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
2004_Charley67376-197429  2004_Charley      29.196990      -80.993800        0
2004_Charley73720-211013  2004_Charley      29.171450      -81.037170        0
2004_Charley73705-210991  2004_Charley      28.308746      -81.424273        1
2004_Charley65157-192371  2004_Charley      28.308746      -81.424273        1
2004_Charley65126-192326  2004_Charley      28.308746      -81.424273        1

stats:

In [151]: res.cluster.value_counts()
Out[151]:
1    217
0    170
2    145
4     94
7     18
6     16
5     14
3     14
Name: cluster, dtype: int64

if you don't want to have RecordID as index:

In [153]: res = res.reset_index()

In [154]: res.head(10)
Out[154]:
                   RecordID         Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
0  2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
1  2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
2  2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
3  2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
4  2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
5  2004_Charley67376-197429  2004_Charley      29.196990      -80.993800        0
6  2004_Charley73720-211013  2004_Charley      29.171450      -81.037170        0
7  2004_Charley73705-210991  2004_Charley      28.308746      -81.424273        1
8  2004_Charley65157-192371  2004_Charley      28.308746      -81.424273        1
9  2004_Charley65126-192326  2004_Charley      28.308746      -81.424273        1

Upvotes: 1

Related Questions