split data into training and test with pandas with respect to observation name

Question

I would like to split my dataframe into training and test data. There is a great post here on how to do this randomly. However, I need to split it based on names of the observations to make sure (for instance) 2/3 observations with sample name 'X' are allocated to training data and 1/3 of the observations with sample name 'X are allocated to test data.

Here is the top of my DF:

             136       137       138       139  141  143  144  145       146  \
Sample                                                                         
HC10    0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0  0.0  0.140901   
HC10    0.000000  0.000000  0.000000  0.267913  0.0  0.0  0.0  0.0  0.000000   
HC10    0.000000  0.000000  0.000000  0.000000  0.0  0.0  0.0  0.0  0.174445   
HC11    0.059915  0.212442  0.255549  0.000000  0.0  0.0  0.0  0.0  0.000000   
HC11    0.000000  0.115988  0.144056  0.070028  0.0  0.0  0.0  0.0  0.000000   

        147       148  149       150  151       152      154       156  158  \
Sample                                                                        
HC10    0.0  0.189937  0.0  0.052635  0.0  0.148751  0.00000  0.000000  0.0   
HC10    0.0  0.000000  0.0  0.267764  0.0  0.000000  0.00000  0.000000  0.0   
HC10    0.0  0.208134  0.0  0.130212  0.0  0.165507  0.00000  0.000000  0.0   
HC11    0.0  0.000000  0.0  0.000000  0.0  0.000000  0.06991  0.102209  0.0   
HC11    0.0  0.065779  0.0  0.072278  0.0  0.060815  0.00000  0.060494  0.0   

             160  173  
Sample                 
HC10    0.051911  0.0  
HC10    0.281227  0.0  
HC10    0.000000  0.0  
HC11    0.000000  0.0  
HC11    0.073956  0.0

Sample is the index of the dataframe, the rest is numerical.

If I use a solution such as:

train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)

as was suggested here then samples such as HC10 in my df may all be allocated to training data but I will not be able to test my model on them. Does anyone know a quick way (ideally using pandas) that will partition the data in this way?

Many thanks

split data into training and test with pandas with respect to observation name

Answers (1)

Related Questions