Reputation: 21450
I'm trying to do some supervised machine learning on a data set.
My data is organized in a single DataFrame with samples as rows and features as columns. One of my columns contains the category to which the sample belongs.
I would like to split my data set in half such that samples are evenly distributed between categories. Is there a native pandas approach to doing so, or will I have to loop through each row and individually assign each sample to either the training or the testing group?
Here is an illustrative example of how my data is organized. The char
column indicates the category to which each row belongs.
feature char
0 SimpleCV.Features.Blob.Blob object at (38, 74)... A
1 SimpleCV.Features.Blob.Blob object at (284, 26... A
2 SimpleCV.Features.Blob.Blob object at (87, 123... B
3 SimpleCV.Features.Blob.Blob object at (198, 37... B
4 SimpleCV.Features.Blob.Blob object at (345, 60... C
5 SimpleCV.Features.Blob.Blob object at (139, 92... C
6 SimpleCV.Features.Blob.Blob object at (167, 83... D
7 SimpleCV.Features.Blob.Blob object at (57, 54)... D
8 SimpleCV.Features.Blob.Blob object at (35, 77)... E
9 SimpleCV.Features.Blob.Blob object at (136, 73... E
Refering to the above example, I'd like to end up with two DataFrames, each containing half of the samples in each char
category. In this example, there are two of each char
types, so the resulting DataFrames would each have one A
row, one B
row, etc...
I should mention, however, that the number of rows in each char
category in my actual data can vary.
Thanks very much in advance!
Upvotes: 3
Views: 1058
Reputation: 251448
Here is one way:
>>> print d
A B Cat
0 -1.703752 0.659098 X
1 0.418694 0.507111 X
2 0.385922 1.055286 Y
3 -0.909748 -0.900903 Y
4 -0.845475 1.681000 Y
5 1.257767 2.465161 Y
>>> def whichHalf(t):
... t['Div'] = 'Train'
... t[:len(t)/2]['Div'] = 'Test'
... return t
>>> d.groupby('Cat').apply(whichHalf)
A B Cat Div
0 -1.703752 0.659098 X Test
1 0.418694 0.507111 X Train
2 0.385922 1.055286 Y Test
3 -0.909748 -0.900903 Y Test
4 -0.845475 1.681000 Y Train
5 1.257767 2.465161 Y Train
This assigns the first half of each group to the test set and the second half to the training set. You can then get the two sets by filtering on this new "Div" column. Note that this will only work if each category has an even number of data points. If a category doesn't have an even number of data points, then obviously you can't divide it equally into two parts.
Upvotes: 3