Vince
Vince

Reputation: 245

In Python sklearn, how do I retrieve the names of samples/variables in test/training data?

#I have imported the dataset with pandas
df = pd.read_csv(filename)
####Preparing data for sklearn
#1)Dropped the names of each sample
df.drop(['id'], 1, inplace=True)
#2)Isolate data and remove column with classification (X) and isolation classification column (y)
X = np.array(df.drop(['class'],1))
y = np.array(df['class'])
######
#Split data into testing/training datasets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y,test_size=0.4)

QUESTION: If I wanted the names of which samples are in the test/training data (after testing), how do I retrieve them?

Upvotes: 0

Views: 1863

Answers (1)

andrew_reece
andrew_reece

Reputation: 21274

If you make id the index of df, you'll retain the index values after running train_test_split. First, let's generate some example data:

import numpy as np
import pandas as pd

N = 10
ids = ['a','b','c','d','e','f','g','h','i','j']
values = np.random.random(N)
classes = np.random.binomial(n=1,p=.5,size=N)
df = pd.DataFrame({'id':ids,'predictor':values,'label':classes})

Then explicitly set id as the index:

df.set_index('id', inplace=True)

Now df looks like this:

    label  predictor
id                  
a       1   0.214636
b       0   0.466477
c       1   0.300480
d       1   0.378645
e       0   0.755834
f       1   0.506719
g       0   0.948360
h       0   0.736498
i       1   0.058591
j       1   0.997003

Splitting into train/test sets using Pandas objects will retain their original index values:

X = df.predictor
y = df.label

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

print(X_train)
id
a    0.214636
b    0.466477
d    0.378645
j    0.997003
i    0.058591
f    0.506719
Name: predictor, dtype: float64

Upvotes: 1

Related Questions