Pandas Dataframe used for Train_Test_Split

Question

I have the following dataset which I want to analyze using the K-Nearest-Neighbor using pandas and sklearn:

https://archive.ics.uci.edu/ml/datasets/Credit+Approval

First I load the dataset in a dataframe, with headers (A1 to A16) for each column:

df =  pd.read_csv('crx.csv').dropna().reset_index(drop = True)

My question is , how do I use the train_test_split function to:

a) Split the data in test and train and

b) Also flag the columns A1 to A15 as features and A16 as label?

I would like to have something like the below code, which of course doesn't work as I would like:

X_train, X_test, y_train, y_test = train_test_split(df[0:16], df['A16'], random_state=0)

where X_train will have 75% of data from column A1 to A15, X_test the rest 25%, y_train the same 75% of data but only the column A16 (target) and y_test the rest 25%.

My intention is later to use KNeighborsClassifier.fit with the training data.

user707650 · Accepted Answer

Use df.iloc[:,0:16] instead (or df[['A1', 'A2', 'A3', ...]], but that is more cumbersome in your case):

train_test_tuple = train_test_split(df.iloc[:,0:16], df['A16'], random_state=0)

See the documentation for iloc.

Pandas Dataframe used for Train_Test_Split

Answers (1)

Related Questions