mach
mach

Reputation: 318

Machine Learning- Dividing data into test and train sets

How to divide a given dataset into train and test sets along with their correct labels.

There is an implementation for same through sklearn library :

from sklearn.cross_validation import train_test_split

train, test = train_test_split(df, test_size = 0.2)

where df is the original dataset....for eg : a list of strings

The problem is that it doesnt take the target/labels along with the data sets. So we cannot track which label belongs to what data point...

Is there any way to bind data points and their labels and then split the data sets into train and test?

Upvotes: 3

Views: 641

Answers (1)

Ami Tavory
Ami Tavory

Reputation: 76297

sklearn.cross_validation.train_test_split essentially takes a variable number of arrays which it will split

*arrays : sequence of arrays or scipy.sparse matrices with same shape[0]

Returns:
splitting : list of arrays, length=2 * len(arrays) List containing train-test split of input array.

so you can just add along the labels list:

from sklearn import cross_validation

df = ['the', 'quick', 'brown', 'fox']
labels = [0, 1, 0, 0]

>> cross_validation.train_test_split(df, labels, test_size=0.2)
[['quick', 'fox', 'the'], ['brown'], [1, 0, 0], [0]]

Upvotes: 4

Related Questions