Reputation: 1645
I have a dataframe in python as shown below:
data labels group
aa 1 x
bb 1 x
cc 2 y
dd 1 y
ee 3 y
ff 3 x
gg 3 z
hh 1 z
ii 2 z
It is straight forward to randomly split into 70:30 for training and test sets. Here, i need to split into test and train so that 70% of data within each group
should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group.
I find that cross_val_score
does the splitting, fitting model and predciting with the below function:
>>> from sklearn.model_selection import cross_val_score
>>> model = LogisticRegression(random_state=0)
>>> scores = cross_val_score(model, data, labels, cv=5)
>>> scores
The documentation of cross_val_score
have groups
parameter which says:
groups : array-like, with shape (n_samples,), optional
Group labels for the samples used while splitting the dataset into
train/test set.
Here, i need to split into test and train so that 70% of data within each group should be in training and 30% of data within each group as test data. Then predict and find accuracy of test data within each group. Does using the groups parameter in the below way split data within each group into training and test data and make the predictions?
>>> scores = cross_val_score(model, data, labels, groups= group, cv=5)
Any help is appreciated.
Upvotes: 2
Views: 4700
Reputation: 454
To specify your train and validation sets in this way you will need to create a cross-validation object and not use the cv=5
argument to cross_val_score
. The trick is you want to stratify the folds but not based on the classes in y
, rather based on another column of data. I think you can use StratifiedShuffleSplit
for this like the following.
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4],
[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1])
groups_to_stratify = np.array([1,2,3,1,2,3,1,2,3,1,2,3])
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
sss.get_n_splits()
print(sss)
# Note groups_to_stratify is used in the split() function not y as usual
for train_index, test_index in sss.split(X, groups_to_stratify):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print("TRAIN indices:", train_index,
"train groups", groups_to_stratify[train_index],
"TEST indices:", test_index,
"test groups", groups_to_stratify[test_index])
Upvotes: 0
Reputation: 7410
There is no way that I know straight from the function, but you could apply
train_test_split
to the groups and then concatenate the splits with pd.concat
like:
def train_test_split_group(x):
X_train, X_test, y_train, y_test = train_test_split(x['data'],x['labels'])
return pd.Series([X_train, X_test, y_train, y_test], index=['X_train', 'X_test', 'y_train', 'y_test'])
final = df.groupby('group').apply(train_test_split_group).apply(lambda x: pd.concat(x.tolist()))
final['X_train'].dropna()
1 bb
3 dd
4 ee
5 ff
6 gg
7 hh
Name: X_train, dtype: object
Upvotes: 1
Reputation: 5955
The stratify
parameter of train_test_split
takes the labels on which to stratify the selection to maintain proper class balance.
X_train, X_test, y_train, y_test = train_test_split(df['data'], df['labels'],stratify=df['group'])
On your toy dataset, it seems to be what you want, but I would try it on your full dataset and verify whether the classes are balanced by checking counts of data in your train and test sets
Upvotes: 2