Reputation: 478
I have a data set of subjects and each of them has a number of rows in my pandas dataframe (each measurement is a row and a subject could measure a few times). I would like to split my data into training and test set but I cannot split randomly because all subject's measurements are dependent (cannot put the same subject in the train and test). How would you reslove this? I have a pandas dataframe and each subject has a different number of measurements.
Edit: My data includes the subject number for each row and I would like to split as close to 0.8/0.2 as possible.
Upvotes: 4
Views: 2235
Reputation: 294338
Consider the dataframe df
with column user_id
to identify users.
df = pd.DataFrame(
np.random.randint(5, size=(100, 4)), columns=['user_id'] + list('ABC')
)
You want to identify unique users and randomly select some. Then split your dataframe in order to put all test users in one and train users in the other.
unique_users = df['user_id'].unique()
train_users, test_users = np.split(
np.random.permutation(unique_users), [int(.8 * len(unique_users))]
)
df_train = df[df['user_id'].isin(train_users)]
df_test = df[df['user_id'].isin(test_users)]
This should roughly split your data into 80/20.
However, if you care to keep it as balanced as possible, then you must add users incrementally.
unique_users = df['user_id'].unique()
target_n = int(.8 * len(df))
shuffled_users = np.random.permutation(unique_users)
user_count = df['user_id'].value_counts()
mapping = user_count.reindex(shuffled_users).cumsum() <= target_n
mask = df['user_id'].map(mapping)
df_train = df[mask]
df_test = df[~mask]
Upvotes: 3