Solving Py
Solving Py

Reputation: 25

Normalize values according to category in pandas

My dataframe looks like this

#    pubdate  lang Count
# 0  20140619 en   3
# 1  20150308 en   1
# 2  20140207 en   1
# 3  20180319 en   1
# 4  20150223 en   1

I would like to normalize the count by replacing it with itself divided by the average count/year.

For example, here is how the table looks at the end

#    pubdate  lang Count
# 0  20140619 en   1.5
# 1  20150308 en   1
# 2  20140207 en   0.5
# 3  20180319 en   1
# 4  20150223 en   1

So for year 2014, the average of all counts was (3+1)/(2 rows of 2014) = 2, then each value was divided by it.

I thought about first duplicating the frame, having a year column, then grouping per year, then changing values in first table according to that.

I do not know how to do it in code

Upvotes: 0

Views: 612

Answers (2)

ipj
ipj

Reputation: 3598

You can also calculate step by step keeping all temporary calculations. First convert pubdate to datetime type and extract year:

import datetime 
df['pubdate_year'] = pd.to_datetime(df.pubdate, format='%Y%m%d').dt.to_period('Y')

then group by year and calculate mean of Count:

df['year_mean'] = df.groupby(['pubdate_year']).Count.transform('mean')

finally just divide columns:

df['normalized_count'] = df['Count'] / df['year_mean'] 

result contains all steps of calculations:

    pubdate lang  Count pubdate_year  year_mean  normalized_count
0  20140619   en      3         2014          2               1.5
1  20150308   en      1         2015          1               1.0
2  20140207   en      1         2014          2               0.5
3  20180319   en      1         2018          1               1.0
4  20150223   en      1         2015          1               1.0

If You don't need to keep temporary calculations:

df = df.drop(columns=['Count','pubdate_year','year_mean']).rename(columns={'normalized_count':'Count'})

    pubdate lang  Count
0  20140619   en    1.5
1  20150308   en    1.0
2  20140207   en    0.5
3  20180319   en    1.0
4  20150223   en    1.0

Upvotes: 2

yatu
yatu

Reputation: 88236

Here's one slicing on the year substring and using it as a grouper, then divide the Count by the mean with transform to preserve the shape of the dataframe:

df['Count'] /= df.groupby(df.pubdate.astype(str).str[:4]).Count.transform('mean')

print(df)

    pubdate lang  Count
0  20140619   en    1.5
1  20150308   en    1.0
2  20140207   en    0.5
3  20180319   en    1.0
4  20150223   en    1.0

Upvotes: 1

Related Questions