KingPolygon
KingPolygon

Reputation: 4755

Preparing CSV file data for Scikit-Learn Using Pandas?

I have a csv file without headers which I'm importing into python using pandas. The last column is the target class, while the rest of the columns are pixel values for images. How can I go ahead and split this dataset into a training set and a testing set using pandas (80/20)?

Also, once that is done how would I also split each of those sets so that I can define x (all columns except the last one), and y (the last column)?

I've imported my file using:

dataset = pd.read_csv('example.csv', header=None, sep=',')

Thanks

Upvotes: 6

Views: 9630

Answers (3)

user2285236
user2285236

Reputation:

I'd recommend using sklearn's train_test_split

from sklearn.model_selection import train_test_split
# for older versions import from sklearn.cross_validation
# from sklearn.cross_validation import train_test_split
X, y = dataset.iloc[:, :-1], dataset.iloc[:, -1]
kwargs = dict(test_size=0.2, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, **kwargs)

Upvotes: 10

Kartik
Kartik

Reputation: 8703

You can simply do:

choices = np.in1d(dataset.index, np.random.choice(dataset.index,int(0.8*len(dataset)),replace=False))
training = dataset[choices]
testing = dataset[np.invert(choices)]

Then, to pass it as x and y to Scikit-Learn:

scikit_func(x=training.iloc[:,0:-1], y=training.iloc[:,-1])

Let me know if this doesn't work.

Upvotes: 0

Randhawa
Randhawa

Reputation: 644

You can try this.

Sperating target class from the rest:

pixel_values = Dataset[df.columns[0:len(Dataset.axes[1])-1]]
target_class = Dataset[df.columns[len(Dataset.axes[1])-1:]]

Now to create test and training samples:

I would just use numpy's randn:

 mask = np.random.rand(len(pixel_values )) < 0.8
 train = pixel_values [mask]
 test = pixel_values [~msk] 

Now you have traning and test samples in train and test with 80:20 ratio.

Upvotes: 1

Related Questions