chas
chas

Reputation: 1645

split into train and test by group+ sklearn cross_val_score

I have a dataframe in python as shown below:

data    labels    group
 aa       1         x
 bb       1         x
 cc       2         y
 dd       1         y
 ee       3         y
 ff       3         x
 gg       3         z
 hh       1         z
 ii       2         z

It is straight forward to randomly split into 70:30 for training and test sets. Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group.

I find that cross_val_score does the splitting, fitting model and predciting with the below function:

>>> from sklearn.model_selection import cross_val_score
>>> model = LogisticRegression(random_state=0)
>>> scores = cross_val_score(model, data, labels, cv=5)
>>> scores   

The documentation of cross_val_score have groups parameter which says:

groups : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into 
train/test set.

Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group. Does using the groups parameter in the below way split data within each group into training and test data and make the predictions?

>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)

Any help is appreciated.

Upvotes: 2

Views: 4700

Answers (3)

Avi
Avi

Reputation: 454

To specify your train and validation sets in this way you will need to create a cross-validation object and not use the cv=5 argument to cross_val_score. The trick is you want to stratify the folds but not based on the classes in y, rather based on another column of data. I think you can use StratifiedShuffleSplit for this like the following.

from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4], 
              [1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1])

groups_to_stratify = np.array([1,2,3,1,2,3,1,2,3,1,2,3])

sss = StratifiedShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
sss.get_n_splits()

print(sss)       

# Note groups_to_stratify is used in the split() function not y as usual
for train_index, test_index in sss.split(X, groups_to_stratify):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("TRAIN indices:", train_index, 
          "train groups", groups_to_stratify[train_index],
          "TEST indices:", test_index, 
          "test groups", groups_to_stratify[test_index])

Upvotes: 0

Franco Piccolo
Franco Piccolo

Reputation: 7410

There is no way that I know straight from the function, but you could apply train_test_split to the groups and then concatenate the splits with pd.concat like:

def train_test_split_group(x):
    X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])
    return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])

final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))
final['X_train'].dropna()

1    bb
3    dd
4    ee
5    ff
6    gg
7    hh
Name: X_train, dtype: object

Upvotes: 1

G. Anderson
G. Anderson

Reputation: 5955

The stratify parameter of train_test_split takes the labels on which to stratify the selection to maintain proper class balance.

X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])

On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets

Upvotes: 2

Related Questions