Reputation: 16199
The following extraction from a CSV read into pandas as pd
:
return pd.Series((wb['impressions'].values * 1.0)/(wb['ad_requests'].values * 1.0), index=wb['\xef\xbb\xbf"ad_tag_name"']).to_dict()
is no longer valid, since now, as per the image below,
I need to now sum these multiple values per "key" of column A, and do likewise for column C, then plug these back into the division calculation and series creation.
Experimenting with groupby()
in isolation for the former behaves well (duplicate keys are removed, which is what I want):
In [36]: wb.groupby('\xef\xbb\xbf"ad_tag_name"').sum()['impressions']
Out[36]:
"ad_tag_name"
he.com_300x250_bottomloopmobile 26752
he.com_300x250_bottomslidemobile 31217
Yet when I add in back in index=wb['\xef\xbb\xbf"ad_tag_name"']
to try rebuild the full formula, pandas no longer drops the duplicates:
In [37]: pd.Series(wb.groupby('\xef\xbb\xbf"ad_tag_name"').sum()['impressions'], index=wb['\xef\xbb\xbf"ad_tag_name"'])
Out[37]:
"ad_tag_name"
he.com_300x250_bottomloopmobile 26752
he.com_300x250_bottomloopmobile 26752
he.com_300x250_bottomloopmobile 26752
he.com_300x250_bottomslidemobile 31217
he.com_300x250_bottomslidemobile 31217
he.com_300x250_bottomslidemobile 31217
he.com_300x250_bottomslidemobile 31217
Assuming the groupby()
component of the formula can stay as is, how do we tell the series creation to recognize the duplicate keys of the index column?
Upvotes: 1
Views: 120
Reputation: 862681
It seems you need assign output to wb
- groupby
with sum
aggregate all numeric columns, so no duplicates, last add as_index=False
for DataFrame
output:
wb = wb.groupby("ad_tag_name", as_index=False).sum()
#alternative solution
#wb = wb.groupby("ad_tag_name").sum().reset_index()
Sample:
wb = pd.DataFrame({'ad_tag_name':['he.com_300x250_bottomloopmobile'] * 3 +
['he.he.com_300x250_bottomslidemobile'] * 4,
'impressions':[309, 3029,23414,1465,5725,2918,11109],
'ad_requests':[37849,np.nan,np.nan, 42300,np.nan, np.nan, np.nan]})
#print (wb)
wb = wb.groupby('ad_tag_name', as_index=False).sum()
print (wb)
ad_tag_name ad_requests impressions
0 he.com_300x250_bottomloopmobile 37849.0 26752
1 he.he.com_300x250_bottomslidemobile 42300.0 21217
a = pd.Series((wb['impressions'].values * 1.0)/(wb['ad_requests'].values * 1.0),
index=wb['ad_tag_name']).to_dict()
print (a)
{'he.he.com_300x250_bottomslidemobile': 0.50158392434988175,
'he.com_300x250_bottomloopmobile': 0.70680863431002139}
Also for remove \xef\xbb\xbf
add encoding='utf-8-sig'
to read_csv
or upgrade pandas to last version, because this bug is closed and solved.
Upvotes: 1