How can I make a DataFrame containing half of the data from another DataFrame, distributed evenly across values in a column?

Question

I'm trying to do some supervised machine learning on a data set.

My data is organized in a single DataFrame with samples as rows and features as columns. One of my columns contains the category to which the sample belongs.

I would like to split my data set in half such that samples are evenly distributed between categories. Is there a native pandas approach to doing so, or will I have to loop through each row and individually assign each sample to either the training or the testing group?

Here is an illustrative example of how my data is organized. The char column indicates the category to which each row belongs.

                                              feature char
0   SimpleCV.Features.Blob.Blob object at (38, 74)...    A
1   SimpleCV.Features.Blob.Blob object at (284, 26...    A
2   SimpleCV.Features.Blob.Blob object at (87, 123...    B
3   SimpleCV.Features.Blob.Blob object at (198, 37...    B
4   SimpleCV.Features.Blob.Blob object at (345, 60...    C
5   SimpleCV.Features.Blob.Blob object at (139, 92...    C
6   SimpleCV.Features.Blob.Blob object at (167, 83...    D
7   SimpleCV.Features.Blob.Blob object at (57, 54)...    D
8   SimpleCV.Features.Blob.Blob object at (35, 77)...    E
9   SimpleCV.Features.Blob.Blob object at (136, 73...    E

Refering to the above example, I'd like to end up with two DataFrames, each containing half of the samples in each char category. In this example, there are two of each char types, so the resulting DataFrames would each have one A row, one B row, etc...

I should mention, however, that the number of rows in each char category in my actual data can vary.

Thanks very much in advance!

BrenBarn · Accepted Answer

Here is one way:

>>> print d
          A         B Cat
0 -1.703752  0.659098   X
1  0.418694  0.507111   X
2  0.385922  1.055286   Y
3 -0.909748 -0.900903   Y
4 -0.845475  1.681000   Y
5  1.257767  2.465161   Y
>>> def whichHalf(t):
...     t['Div'] = 'Train'
...     t[:len(t)/2]['Div'] = 'Test'
...     return t
>>> d.groupby('Cat').apply(whichHalf)
          A         B Cat    Div
0 -1.703752  0.659098   X   Test
1  0.418694  0.507111   X  Train
2  0.385922  1.055286   Y   Test
3 -0.909748 -0.900903   Y   Test
4 -0.845475  1.681000   Y  Train
5  1.257767  2.465161   Y  Train

This assigns the first half of each group to the test set and the second half to the training set. You can then get the two sets by filtering on this new "Div" column. Note that this will only work if each category has an even number of data points. If a category doesn't have an even number of data points, then obviously you can't divide it equally into two parts.

How can I make a DataFrame containing half of the data from another DataFrame, distributed evenly across values in a column?

Answers (1)

Related Questions