Reputation: 340
I have a data frame as following, and I need to split it into training and test set in a way that if I have one specific ID in train it should not be repeated in test set.
Row ID AGE GENDER TIME CODE
0 1 66 M 1 0
1 1 66 M 2 0
2 1 66 M 3 1
3 2 20 F 1 0
4 2 20 F 2 0
5 2 20 F 3 0
6 2 20 F 4 0
7 3 18 F 1 0
8 3 18 F 2 0
9 3 18 F 3 0
10 3 18 F 4 1
the desired output in training set should be like this
Row ID AGE GENDER TIME CODE
0 1 66 M 1 0
1 1 66 M 2 0
2 1 66 M 3 1
3 2 20 F 1 0
4 2 20 F 2 0
5 2 20 F 3 0
6 2 20 F 4 0
and test set should be like
Row ID AGE GENDER TIME CODE
0 3 18 F 1 0
1 3 18 F 2 0
2 3 18 F 3 0
3 3 18 F 4 1
how is it possible doing this in pandas python?
Thanks in advance
Upvotes: 0
Views: 145
Reputation: 11192
try this,
ids=df['ID'].unique()
t= ids[:int(round(len(ids)*0.60))]
train=df[df['ID'].isin(t)]
test=df[~df['ID'].isin(t)]
Input:
Row ID AGE GENDER TIME CODE
0 0 1 66 M 1 0
1 1 1 66 M 2 0
2 2 1 66 M 3 1
3 3 2 20 F 1 0
4 4 2 20 F 2 0
5 5 2 20 F 3 0
6 6 2 20 F 4 0
7 7 3 18 F 1 0
8 8 3 18 F 2 0
9 9 3 18 F 3 0
10 10 3 18 F 4 1
Output:
Train:
Row ID AGE GENDER TIME CODE flag
0 0 1 66 M 1 0 0
1 1 1 66 M 2 0 0
2 2 1 66 M 3 1 0
3 3 2 20 F 1 0 1
4 4 2 20 F 2 0 1
5 5 2 20 F 3 0 1
6 6 2 20 F 4 0 1
Test:
Row ID AGE GENDER TIME CODE flag
7 7 3 18 F 1 0 2
8 8 3 18 F 2 0 2
9 9 3 18 F 3 0 2
10 10 3 18 F 4 1 2
Upvotes: 1