Kda
Kda

Reputation: 57

Preprocessing raw data in Machine Learning using Python

I have a raw dataset with 9 features [numerical] and the 10th column is a categorical [country = france, germany, india, china, mexico]. the dataset has 20000 rows. Many of the numerical feature columns have missing data and they are not in scale. I am supposed to predict a feature value which lies in the 5th column position in the dataset.

How should I approach it?

Should I :

  1. Preprocess the entire raw dataset with Imputer (for missing data), Encoder for categorical and Feature scale.

  2. Split the preprocessed data in training and test set.

Or it would be other way around:

  1. Split the raw data in to training and test set
  2. Preprocess the training set only.

Reason I am confused: Once I preprocess the raw dataset, the categorical columns would explode into 5 new columns. So how do I snip out the independent and dependent variable(5th column) out of this dataset to produce x and y array respectively, which I can split to x_train, x_test, y_train, y_test in this formula:

 from sklearn.cross_validation import train_test_split
 x_train,x_test,y_train,y_test = 
 train_test_split(x,y,test_size=1/3,random_state=0)

Upvotes: 0

Views: 833

Answers (2)

JKC
JKC

Reputation: 2618

If you are using Pandas, you can pre-process the data by mentioning the column names instead of column position. By this way you do not need to worry about where your Target Variable lies in the dataset.

If you are not using Pandas, then better take the target variable out, then pre-process the dataset

For both the above methods, you can do pre-processing either with the overall dataset or after splitting them into two.

Upvotes: 0

Jblasco
Jblasco

Reputation: 3967

As far as I know, you should separate your data first, then deduce the transformation you need to apply from the training set and apply it to both training and validation/testing sets.

The reason being, if you use all data, you'll get more information than if you use only the training data set (say you measure better the mean or stddev you use to treat a column). This means you train your data with the training data set and a hint of the validating/testing data set, which is bad for the predictions you'll later draw from it. This is called data snooping or data dredging (https://en.wikipedia.org/wiki/Data_dredging).

Upvotes: 2

Related Questions