Reputation: 245
#I have imported the dataset with pandas
df = pd.read_csv(filename)
####Preparing data for sklearn
#1)Dropped the names of each sample
df.drop(['id'], 1, inplace=True)
#2)Isolate data and remove column with classification (X) and isolation classification column (y)
X = np.array(df.drop(['class'],1))
y = np.array(df['class'])
######
#Split data into testing/training datasets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y,test_size=0.4)
QUESTION: If I wanted the names of which samples are in the test/training data (after testing), how do I retrieve them?
Upvotes: 0
Views: 1863
Reputation: 21274
If you make id
the index of df
, you'll retain the index values after running train_test_split
.
First, let's generate some example data:
import numpy as np
import pandas as pd
N = 10
ids = ['a','b','c','d','e','f','g','h','i','j']
values = np.random.random(N)
classes = np.random.binomial(n=1,p=.5,size=N)
df = pd.DataFrame({'id':ids,'predictor':values,'label':classes})
Then explicitly set id
as the index:
df.set_index('id', inplace=True)
Now df
looks like this:
label predictor
id
a 1 0.214636
b 0 0.466477
c 1 0.300480
d 1 0.378645
e 0 0.755834
f 1 0.506719
g 0 0.948360
h 0 0.736498
i 1 0.058591
j 1 0.997003
Splitting into train/test sets using Pandas objects will retain their original index values:
X = df.predictor
y = df.label
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
print(X_train)
id
a 0.214636
b 0.466477
d 0.378645
j 0.997003
i 0.058591
f 0.506719
Name: predictor, dtype: float64
Upvotes: 1