HelloWorld
HelloWorld

Reputation: 107

How to split data into train and test using groupby column

Let's say I have a dataframe that looks something like this:
The following table is an example, I have like 120000 questions

Question | Hint | Cluster Label|
q1 |q1_h1 |1
q1 |q1_h2 |1
q1 |q1_h3 |1
q2 |q2_h1 |2
q2 |q2_h2 |2
q3 |q3_h1 |1
q4 |q4_h1 |2
q4 |q4_h2 |2

I want to groupby question and split dataframe into train and test such that associated question and hints are captured together and stratified on label. So output that I require would be:

train:
Question | Hint | Cluster Label|
q1 |q1_h1 |1
q1 |q1_h2 |1
q1 |q1_h3 |1
q2 |q2_h1 |2
q2 |q2_h2 |2

test:
Question | Hint | Cluster Label|
q3 |q3_h1 |1
q4 |q4_h1 |2
q4 |q4_h2 |2

Upvotes: 0

Views: 140

Answers (2)

njp
njp

Reputation: 698

Looks like you need to use GroupKFold or StratifiedGroupKFold.

From the user manual, GroupKFold "is a variation of k-fold which ensures that the same group is not represented in both testing and training sets."

To use it, you call the constructor as normal:

gkf = GroupKFold(n_splits = 5)

and when you call the split method of gkf you specify the variable to group on (in your case 'Question').

If you're using it in GridSearchCV or similar, you specify the group in as the 'groups' variable in the call to GridSearchCV. See previous answer here.

Upvotes: 0

user19077881
user19077881

Reputation: 5430

You can simply split the DataFrame according to the value in Hint:

df_train= df[(df['Hint'].str.contains('q1')) | (df['Hint'].str.contains('q2'))]

and similarly for df_test

Upvotes: 1

Related Questions