Reputation: 107
Let's say I have a dataframe that looks something like this:
The following table is an example, I have like 120000 questions
Question | Hint | Cluster Label|
q1 |q1_h1 |1
q1 |q1_h2 |1
q1 |q1_h3 |1
q2 |q2_h1 |2
q2 |q2_h2 |2
q3 |q3_h1 |1
q4 |q4_h1 |2
q4 |q4_h2 |2
I want to groupby question and split dataframe into train and test such that associated question and hints are captured together and stratified on label. So output that I require would be:
train:
Question | Hint | Cluster Label|
q1 |q1_h1 |1
q1 |q1_h2 |1
q1 |q1_h3 |1
q2 |q2_h1 |2
q2 |q2_h2 |2
test:
Question | Hint | Cluster Label|
q3 |q3_h1 |1
q4 |q4_h1 |2
q4 |q4_h2 |2
Upvotes: 0
Views: 140
Reputation: 698
Looks like you need to use GroupKFold
or StratifiedGroupKFold
.
From the user manual, GroupKFold
"is a variation of k-fold which ensures that the same group is not represented in both testing and training sets."
To use it, you call the constructor as normal:
gkf = GroupKFold(n_splits = 5)
and when you call the split
method of gkf
you specify the variable to group on (in your case 'Question').
If you're using it in GridSearchCV
or similar, you specify the group in as the 'groups' variable in the call to GridSearchCV
. See previous answer here.
Upvotes: 0
Reputation: 5430
You can simply split the DataFrame according to the value in Hint
:
df_train= df[(df['Hint'].str.contains('q1')) | (df['Hint'].str.contains('q2'))]
and similarly for df_test
Upvotes: 1