Reputation: 1644
I would like to split my dataframe into training and test data. There is a great post here on how to do this randomly. However, I need to split it based on names of the observations to make sure (for instance) 2/3 observations with sample name 'X' are allocated to training data and 1/3 of the observations with sample name 'X are allocated to test data.
Here is the top of my DF:
136 137 138 139 141 143 144 145 146 \
Sample
HC10 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.140901
HC10 0.000000 0.000000 0.000000 0.267913 0.0 0.0 0.0 0.0 0.000000
HC10 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.174445
HC11 0.059915 0.212442 0.255549 0.000000 0.0 0.0 0.0 0.0 0.000000
HC11 0.000000 0.115988 0.144056 0.070028 0.0 0.0 0.0 0.0 0.000000
147 148 149 150 151 152 154 156 158 \
Sample
HC10 0.0 0.189937 0.0 0.052635 0.0 0.148751 0.00000 0.000000 0.0
HC10 0.0 0.000000 0.0 0.267764 0.0 0.000000 0.00000 0.000000 0.0
HC10 0.0 0.208134 0.0 0.130212 0.0 0.165507 0.00000 0.000000 0.0
HC11 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.06991 0.102209 0.0
HC11 0.0 0.065779 0.0 0.072278 0.0 0.060815 0.00000 0.060494 0.0
160 173
Sample
HC10 0.051911 0.0
HC10 0.281227 0.0
HC10 0.000000 0.0
HC11 0.000000 0.0
HC11 0.073956 0.0
Sample is the index of the dataframe, the rest is numerical.
If I use a solution such as:
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
as was suggested here then samples such as HC10 in my df may all be allocated to training data but I will not be able to test my model on them. Does anyone know a quick way (ideally using pandas) that will partition the data in this way?
Many thanks
Upvotes: 4
Views: 3136
Reputation: 11424
You can do the sampling group-wize, to keep each group balanced. I will modify your small example:
import pandas as pd
df = pd.DataFrame({
'group': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'x':range(10)
})
train = df.reset_index( # need to keep the index as a column
).groupby('group' # split by "group"
).apply(lambda x: x.sample(frac=0.6) # in each group, do the random split
).reset_index(drop=True # index now is group id - reset it
).set_index('index') # reset the original index
test = df.drop(train.index) # now we can subtract it from the rest of data
Another solution would be using stratified sampling algorithms e.g. from scikit-learn.
Upvotes: 3