Reputation: 61
I am using Python 3.6 and I'm trying to get statistics from subgroups of a dataset. For example, the main grouping factors are Uni
and year
. From there, I want to have some basic stats from other subgroups, for instance, the number of females or number or students taking a science course and their proportions.
The risk of doing straight forward counts is that there may be some double counting. I have already solved the double counting problem, but the code seems too long and the first grouping takes really long when considering there are thousands of students and many more universities and years. I hope there is some other answer out there that is more efficient.
df1 = pd.DataFrame([('USC', 2009, 'A', 'X', 'Science', 'F', 50),
('USC', 2009, 'A', 'Y', 'Science', 'F', 50),
('USC', 2009, 'A', 'Z', 'Arts', 'F', 500),
('USC', 2009, 'A', 'W', 'Arts', 'F', 50),
('USC', 2009, 'B', 'W', 'Arts', 'M', 500),
('USC', 2009, 'B', 'Z', 'Arts', 'M', 50),
('USC', 2009, 'C', 'X', 'Science', 'F', 50),
('USC', 2009, 'C', 'Y', 'Science', 'F', 500),
('USC', 2009, 'C', 'W', 'Arts', 'F', 50),
('USC', 2010, 'D', 'X', 'Science', 'M', 50),
('USC', 2010, 'D', 'Y', 'Science', 'M', 500),
('USC', 2010, 'D', 'W', 'Arts', 'M', 50),
('USC', 2010, 'E', 'X', 'Science', 'M', 50),
('USC', 2010, 'E', 'Y', 'Science', 'M', 500),
('USC', 2010, 'E', 'W', 'Arts', 'M', 50)],
columns=('Uni', 'year', 'student','course','faculty','gender', 'fee'))
the convoluted code used to compile the final data is:
# first grouping - eliminating duplicities
data_tmp = df1.groupby(['Uni', 'year','student'])
data_gds = data_tmp.agg({'fee': 'sum'})
data_prc = (data_gds
.join(data_tmp['gender'].apply(lambda x: 1 if (x[x == 'F'].count()>0) else 0))
.join(data_tmp['faculty'].apply(lambda x: 1 if (x[x == 'Science'].count()>0) else 0 ))
.reset_index()
)
# second grouping - eliminating students
data_tmp = data_prc.groupby(['Uni', 'year'])
data_gds = data_tmp['student'].apply(lambda x: x.unique().shape[0]).to_frame('Num_student')
data_prc = (data_gds
.join(data_tmp.agg({'fee': 'sum'}))
.join(data_tmp.agg({'gender': 'sum'}).rename(columns={'gender': 'gender_female'}) )
.join(data_tmp.agg({'faculty':'sum'}).rename(columns={'faculty': 'faculty_Science'}))
.reset_index()
)
# adding percetages here
data_prc['Prc_Female'] = data_prc['gender_female']/data_prc['Num_student']
data_prc['Prc_Science'] = data_prc['faculty_Science']/data_prc['Num_student']
Also, I do not really need to use sum of fees, but seems the use of aggregation allows me to use join. Seems that the joining statements are the ones that take longer time, I hope there is a way to get avoid their use (or make it better).
The answer from above looks like this:
Upvotes: 0
Views: 80
Reputation: 14216
We can greatly simplify your current code, and there is no need for lambdas
as well.
def make_stats(df):
base = df.groupby(['Uni', 'year'], as_index=False) \
.agg({'student': pd.Series.nunique, 'fee': sum}) \
.rename(columns={'student': 'num_student'})
females = df[df.gender == 'F'].groupby(['Uni', 'year'], as_index=False) \
.agg({'student': pd.Series.nunique}) \
.rename(columns={'student': 'gender_female'})
science = df[df.faculty == 'Science'].groupby(['Uni', 'year'], as_index=False) \
.agg({'course': pd.Series.nunique}) \
.rename(columns={'course': 'faculty_science'})
kwargs = {'how': 'left',
'left_on': ['Uni', 'year'],
'right_on': ['Uni', 'year']}
step_1 = pd.merge(base, females, **kwargs)
step_2 = pd.merge(step_1, science, **kwargs).fillna(0)
step_2['prc_female'] = step_2['gender_female'] / step_2['num_student']
step_2['prc_science'] = step_2['faculty_science'] / step_2['num_student']
return step_2
output = make_stats(df1)
print(output)
Uni year num_student fee gender_female faculty_science prc_female prc_science
0 USC 2009 3 1800 2.0 2 0.666667 0.666667
1 USC 2010 2 1200 0.0 2 0.000000 1.000000
Upvotes: 1