Reputation: 4755
I have a csv file without headers which I'm importing into python using pandas. The last column is the target class, while the rest of the columns are pixel values for images. How can I go ahead and split this dataset into a training set and a testing set using pandas (80/20)?
Also, once that is done how would I also split each of those sets so that I can define x (all columns except the last one), and y (the last column)?
I've imported my file using:
dataset = pd.read_csv('example.csv', header=None, sep=',')
Thanks
Upvotes: 6
Views: 9630
Reputation:
I'd recommend using sklearn's train_test_split
from sklearn.model_selection import train_test_split
# for older versions import from sklearn.cross_validation
# from sklearn.cross_validation import train_test_split
X, y = dataset.iloc[:, :-1], dataset.iloc[:, -1]
kwargs = dict(test_size=0.2, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, **kwargs)
Upvotes: 10
Reputation: 8703
You can simply do:
choices = np.in1d(dataset.index, np.random.choice(dataset.index,int(0.8*len(dataset)),replace=False))
training = dataset[choices]
testing = dataset[np.invert(choices)]
Then, to pass it as x and y to Scikit-Learn:
scikit_func(x=training.iloc[:,0:-1], y=training.iloc[:,-1])
Let me know if this doesn't work.
Upvotes: 0
Reputation: 644
You can try this.
Sperating target class from the rest:
pixel_values = Dataset[df.columns[0:len(Dataset.axes[1])-1]]
target_class = Dataset[df.columns[len(Dataset.axes[1])-1:]]
Now to create test and training samples:
I would just use numpy's randn:
mask = np.random.rand(len(pixel_values )) < 0.8
train = pixel_values [mask]
test = pixel_values [~msk]
Now you have traning and test samples in train and test with 80:20 ratio.
Upvotes: 1