jacob_pelletier
jacob_pelletier

Reputation: 21

filtering data from Pandas dataframes

Background: I am trying to use data from a csv file to make asks questions and make conclusions base on data. The data is a log of patient visits from a clinic in Brazil, including additional patient data, and whether the patient was a no show or not. I have chosen to examine correlations between the patient's age and the no show data.

Problem: Given visit number, patient ID, age, and no show data, how do I compile an array of ages that correlate with the each unique patient ID (so that I can evaluate the mean age of total unique patients visiting the clinic).

My code:

# data set of no shows at a clinic in Brazil
noshow_data = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')

noshow_df = pd.DataFrame(noshow_data)

Here is the beginning of the code, with the head of the whole dataframe of the csv given

# Next I construct a dataframe with only the data I'm interested in:

ptid = noshow_df['PatientId']
ages = noshow_df['Age']
noshow = noshow_df['No-show']
ptid_ages_noshow = pd.DataFrame({'PatientId' : pt_id, 'Ages' : ages, 
                                 'No_show' : noshow})

ptid_ages_noshow

Here I have sorted the data to show the multiple visits of a unique patient

# Now, I know how to determine the total number of unique patients:

# total number of unique patients
num_unique_pts = noshow_df.PatientId.unique()
len(num_unique_pts)

If I want to find the mean age of all the patients during the course of all visits I would use:

# mean age of all vists
ages = noshow_data['Age']
ages.mean()

So my question is this, how could I find the mean age of all the unique patients?

Upvotes: 2

Views: 316

Answers (2)

Joooeey
Joooeey

Reputation: 3867

So you only want to keep one appointment per patient for the calculation? This is how to do it:

noshow_df.drop_duplicates('PatientId')['Age'].mean()

Keep in mind that the age of people changes over time. You need to decide how you want to handle this.

Upvotes: 0

Boubacar Traoré
Boubacar Traoré

Reputation: 359

You can simply use the groupby function available in pandas with restriction to the concerned columns :

ptid_ages_noshow[['PatientId','Ages']].groupby('PatientId').mean()

Upvotes: 1

Related Questions