user3062260
user3062260

Reputation: 1644

split data into training and test with pandas with respect to observation name

I would like to split my dataframe into training and test data. There is a great post here on how to do this randomly. However, I need to split it based on names of the observations to make sure (for instance) 2/3 observations with sample name 'X' are allocated to training data and 1/3 of the observations with sample name 'X are allocated to test data.

Here is the top of my DF:

             136       137       138       139  141  143  144  145       146  \
Sample                                                                         
HC10    0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0  0.0  0.140901   
HC10    0.000000  0.000000  0.000000  0.267913  0.0  0.0  0.0  0.0  0.000000   
HC10    0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0  0.0  0.174445   
HC11    0.059915  0.212442  0.255549  0.000000  0.0  0.0  0.0  0.0  0.000000   
HC11    0.000000  0.115988  0.144056  0.070028  0.0  0.0  0.0  0.0  0.000000   

        147       148  149       150  151       152      154       156  158  \
Sample                                                                        
HC10    0.0  0.189937  0.0  0.052635  0.0  0.148751  0.00000  0.000000  0.0   
HC10    0.0  0.000000  0.0  0.267764  0.0  0.000000  0.00000  0.000000  0.0   
HC10    0.0  0.208134  0.0  0.130212  0.0  0.165507  0.00000  0.000000  0.0   
HC11    0.0  0.000000  0.0  0.000000  0.0  0.000000  0.06991  0.102209  0.0   
HC11    0.0  0.065779  0.0  0.072278  0.0  0.060815  0.00000  0.060494  0.0   

             160  173  
Sample                 
HC10    0.051911  0.0  
HC10    0.281227  0.0  
HC10    0.000000  0.0  
HC11    0.000000  0.0  
HC11    0.073956  0.0

Sample is the index of the dataframe, the rest is numerical.

If I use a solution such as:

train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)

as was suggested here then samples such as HC10 in my df may all be allocated to training data but I will not be able to test my model on them. Does anyone know a quick way (ideally using pandas) that will partition the data in this way?

Many thanks

Upvotes: 4

Views: 3136

Answers (1)

David Dale
David Dale

Reputation: 11424

You can do the sampling group-wize, to keep each group balanced. I will modify your small example:

import pandas as pd
df = pd.DataFrame({
    'group': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], 
    'x':range(10)
})

train = df.reset_index(                  # need to keep the index as a column
    ).groupby('group'                    # split by "group"
    ).apply(lambda x: x.sample(frac=0.6) # in each group, do the random split
    ).reset_index(drop=True              # index now is group id - reset it
    ).set_index('index')                 # reset the original index
test = df.drop(train.index)              # now we can subtract it from the rest of data

Another solution would be using stratified sampling algorithms e.g. from scikit-learn.

Upvotes: 3

Related Questions