Reputation: 35
I have the following dataset which I want to analyze using the K-Nearest-Neighbor using pandas and sklearn:
https://archive.ics.uci.edu/ml/datasets/Credit+Approval
First I load the dataset in a dataframe, with headers (A1
to A16
) for each column:
df = pd.read_csv('crx.csv').dropna().reset_index(drop = True)
My question is , how do I use the train_test_split
function to:
a) Split the data in test and train and
b) Also flag the columns A1
to A15
as features and A16
as label?
I would like to have something like the below code, which of course doesn't work as I would like:
X_train, X_test, y_train, y_test = train_test_split(df[0:16], df['A16'], random_state=0)
where X_train
will have 75% of data from column A1
to A15
, X_test
the rest 25%, y_train
the same 75% of data but only the column A16
(target) and y_test
the rest 25%.
My intention is later to use KNeighborsClassifier.fit
with the training data.
Upvotes: 2
Views: 3587
Reputation:
Use df.iloc[:,0:16]
instead (or df[['A1', 'A2', 'A3', ...]]
, but that is more cumbersome in your case):
train_test_tuple = train_test_split(df.iloc[:,0:16], df['A16'], random_state=0)
See the documentation for iloc.
Upvotes: 1