Reputation: 318
How to divide a given dataset into train and test sets along with their correct labels.
There is an implementation for same through sklearn library :
from sklearn.cross_validation import train_test_split
train, test = train_test_split(df, test_size = 0.2)
where df is the original dataset....for eg : a list of strings
The problem is that it doesnt take the target/labels along with the data sets. So we cannot track which label belongs to what data point...
Is there any way to bind data points and their labels and then split the data sets into train and test?
Upvotes: 3
Views: 641
Reputation: 76297
sklearn.cross_validation.train_test_split
essentially takes a variable number of arrays which it will split
*arrays : sequence of arrays or scipy.sparse matrices with same shape[0]
Returns:
splitting : list of arrays, length=2 * len(arrays) List containing train-test split of input array.
so you can just add along the labels list:
from sklearn import cross_validation
df = ['the', 'quick', 'brown', 'fox']
labels = [0, 1, 0, 0]
>> cross_validation.train_test_split(df, labels, test_size=0.2)
[['quick', 'fox', 'the'], ['brown'], [1, 0, 0], [0]]
Upvotes: 4