Reputation: 11026
I have a CSV file where each row holds some data about a particular patient and a single patient can have multiple rows associated with him or her.
The file itself contains thousands of patient records and what I want to do is randomly select 100 patients from the file and then get all records associated with them and then save them to another CSV file.
So, the file could look like, for example:
patient_id Date Diagnosis Comments
001-001 23.12.2008 Normal Normal
001-001 23.12.2009 Normal Normal
001-002 08.11.2007 Normal Normal
001-003
....
So, I can load the file as:
frame = pd.read_csv('file.csv')
# Get the unique subjects
unique_subjects = frame['patient_id'].unique()
# Use numpy to randomly select some patients
random_us = np.random.choice(unique_subjects, 100)
And then I can load the CSV and then check row by row and select which rows to write back to the CSV file.
I have a feeling pandas
might provide something more direct and I wonder if there is a way to pipe all these operations with it.
Upvotes: 0
Views: 51
Reputation: 150825
You can use isin
to filter those id needed:
random_records = frame[frame['patient_id'].isnin(random_us)]
Upvotes: 1