RobertoST
RobertoST

Reputation: 61

Get basic statistics from subgroups in Python

I am using Python 3.6 and I'm trying to get statistics from subgroups of a dataset. For example, the main grouping factors are Uni and year. From there, I want to have some basic stats from other subgroups, for instance, the number of females or number or students taking a science course and their proportions.

The risk of doing straight forward counts is that there may be some double counting. I have already solved the double counting problem, but the code seems too long and the first grouping takes really long when considering there are thousands of students and many more universities and years. I hope there is some other answer out there that is more efficient.

df1 = pd.DataFrame([('USC', 2009, 'A', 'X', 'Science', 'F', 50),
               ('USC', 2009, 'A', 'Y', 'Science', 'F', 50),
               ('USC', 2009, 'A', 'Z', 'Arts', 'F', 500),
               ('USC', 2009, 'A', 'W', 'Arts', 'F', 50),
               ('USC', 2009, 'B', 'W', 'Arts', 'M', 500),
               ('USC', 2009, 'B', 'Z', 'Arts', 'M', 50),
               ('USC', 2009, 'C', 'X', 'Science', 'F', 50),
               ('USC', 2009, 'C', 'Y', 'Science', 'F', 500),
               ('USC', 2009, 'C', 'W', 'Arts', 'F', 50),
               ('USC', 2010, 'D', 'X', 'Science', 'M', 50),
               ('USC', 2010, 'D', 'Y', 'Science', 'M', 500),
               ('USC', 2010, 'D', 'W', 'Arts', 'M', 50),
               ('USC', 2010, 'E', 'X', 'Science', 'M', 50),
               ('USC', 2010, 'E', 'Y', 'Science', 'M', 500),
               ('USC', 2010, 'E', 'W', 'Arts', 'M', 50)],
               columns=('Uni', 'year', 'student','course','faculty','gender', 'fee'))

the convoluted code used to compile the final data is:

# first grouping - eliminating duplicities
data_tmp = df1.groupby(['Uni', 'year','student'])
data_gds = data_tmp.agg({'fee': 'sum'})
data_prc = (data_gds
 .join(data_tmp['gender'].apply(lambda x: 1 if (x[x == 'F'].count()>0) else 0))
 .join(data_tmp['faculty'].apply(lambda x: 1 if (x[x == 'Science'].count()>0) else 0 ))
 .reset_index()
)
# second grouping - eliminating students
data_tmp = data_prc.groupby(['Uni', 'year'])
data_gds = data_tmp['student'].apply(lambda x: x.unique().shape[0]).to_frame('Num_student')
data_prc = (data_gds
 .join(data_tmp.agg({'fee': 'sum'}))
 .join(data_tmp.agg({'gender': 'sum'}).rename(columns={'gender': 'gender_female'}) ) 
 .join(data_tmp.agg({'faculty':'sum'}).rename(columns={'faculty': 'faculty_Science'}))              
 .reset_index()
)
# adding percetages here
data_prc['Prc_Female']  = data_prc['gender_female']/data_prc['Num_student']
data_prc['Prc_Science'] = data_prc['faculty_Science']/data_prc['Num_student']

Also, I do not really need to use sum of fees, but seems the use of aggregation allows me to use join. Seems that the joining statements are the ones that take longer time, I hope there is a way to get avoid their use (or make it better).

The answer from above looks like this: enter image description here

Upvotes: 0

Views: 80

Answers (1)

gold_cy
gold_cy

Reputation: 14216

We can greatly simplify your current code, and there is no need for lambdas as well.

def make_stats(df):
    base = df.groupby(['Uni', 'year'], as_index=False) \
             .agg({'student': pd.Series.nunique, 'fee': sum}) \
             .rename(columns={'student': 'num_student'})

    females = df[df.gender == 'F'].groupby(['Uni', 'year'], as_index=False) \
                                  .agg({'student': pd.Series.nunique}) \
                                  .rename(columns={'student': 'gender_female'})

    science = df[df.faculty == 'Science'].groupby(['Uni', 'year'], as_index=False) \
                                         .agg({'course': pd.Series.nunique}) \
                                         .rename(columns={'course': 'faculty_science'})

    kwargs = {'how': 'left', 
              'left_on': ['Uni', 'year'], 
              'right_on': ['Uni', 'year']}

    step_1 = pd.merge(base, females, **kwargs)
    step_2 = pd.merge(step_1, science, **kwargs).fillna(0)

    step_2['prc_female'] = step_2['gender_female'] / step_2['num_student']
    step_2['prc_science'] = step_2['faculty_science'] / step_2['num_student']

    return step_2

output = make_stats(df1)

print(output)

   Uni  year  num_student   fee  gender_female  faculty_science  prc_female  prc_science
0  USC  2009            3  1800            2.0                2    0.666667     0.666667
1  USC  2010            2  1200            0.0                2    0.000000     1.000000

Upvotes: 1

Related Questions