Reputation: 25
My dataframe looks like this
# pubdate lang Count
# 0 20140619 en 3
# 1 20150308 en 1
# 2 20140207 en 1
# 3 20180319 en 1
# 4 20150223 en 1
I would like to normalize the count by replacing it with itself divided by the average count/year.
For example, here is how the table looks at the end
# pubdate lang Count
# 0 20140619 en 1.5
# 1 20150308 en 1
# 2 20140207 en 0.5
# 3 20180319 en 1
# 4 20150223 en 1
So for year 2014, the average of all counts was (3+1)/(2 rows of 2014) = 2, then each value was divided by it.
I thought about first duplicating the frame, having a year column, then grouping per year, then changing values in first table according to that.
I do not know how to do it in code
Upvotes: 0
Views: 612
Reputation: 3598
You can also calculate step by step keeping all temporary calculations.
First convert pubdate
to datetime type and extract year:
import datetime
df['pubdate_year'] = pd.to_datetime(df.pubdate, format='%Y%m%d').dt.to_period('Y')
then group by year and calculate mean of Count
:
df['year_mean'] = df.groupby(['pubdate_year']).Count.transform('mean')
finally just divide columns:
df['normalized_count'] = df['Count'] / df['year_mean']
result contains all steps of calculations:
pubdate lang Count pubdate_year year_mean normalized_count
0 20140619 en 3 2014 2 1.5
1 20150308 en 1 2015 1 1.0
2 20140207 en 1 2014 2 0.5
3 20180319 en 1 2018 1 1.0
4 20150223 en 1 2015 1 1.0
If You don't need to keep temporary calculations:
df = df.drop(columns=['Count','pubdate_year','year_mean']).rename(columns={'normalized_count':'Count'})
pubdate lang Count
0 20140619 en 1.5
1 20150308 en 1.0
2 20140207 en 0.5
3 20180319 en 1.0
4 20150223 en 1.0
Upvotes: 2
Reputation: 88236
Here's one slicing on the year substring and using it as a grouper, then divide the Count
by the mean
with transform
to preserve the shape of the dataframe:
df['Count'] /= df.groupby(df.pubdate.astype(str).str[:4]).Count.transform('mean')
print(df)
pubdate lang Count
0 20140619 en 1.5
1 20150308 en 1.0
2 20140207 en 0.5
3 20180319 en 1.0
4 20150223 en 1.0
Upvotes: 1