Pandas Series from DataFrame columns when the index column contains duplicates

Question

The following extraction from a CSV read into pandas as pd:

return pd.Series((wb['impressions'].values * 1.0)/(wb['ad_requests'].values * 1.0), index=wb['\xef\xbb\xbf"ad_tag_name"']).to_dict()

is no longer valid, since now, as per the image below,

there are multiple entries of the same name in column A (3 of he.com_300x250_bottomloopmobile, 4 of he.com_300x250_bottomloopmobile for example)
column C will be always be blank for all but the first entry for each unique value.

I need to now sum these multiple values per "key" of column A, and do likewise for column C, then plug these back into the division calculation and series creation.

Experimenting with groupby() in isolation for the former behaves well (duplicate keys are removed, which is what I want):

In [36]: wb.groupby('\xef\xbb\xbf"ad_tag_name"').sum()['impressions']
Out[36]: 
"ad_tag_name"
he.com_300x250_bottomloopmobile                              26752
he.com_300x250_bottomslidemobile                             31217

Yet when I add in back in index=wb['\xef\xbb\xbf"ad_tag_name"'] to try rebuild the full formula, pandas no longer drops the duplicates:

In [37]: pd.Series(wb.groupby('\xef\xbb\xbf"ad_tag_name"').sum()['impressions'], index=wb['\xef\xbb\xbf"ad_tag_name"'])
Out[37]: 
"ad_tag_name"
he.com_300x250_bottomloopmobile          26752
he.com_300x250_bottomloopmobile          26752
he.com_300x250_bottomloopmobile          26752
he.com_300x250_bottomslidemobile         31217
he.com_300x250_bottomslidemobile         31217
he.com_300x250_bottomslidemobile         31217
he.com_300x250_bottomslidemobile         31217

Assuming the groupby() component of the formula can stay as is, how do we tell the series creation to recognize the duplicate keys of the index column?

jezrael · Accepted Answer

It seems you need assign output to wb - groupby with sum aggregate all numeric columns, so no duplicates, last add as_index=False for DataFrame output:

wb = wb.groupby("ad_tag_name", as_index=False).sum()
#alternative solution
#wb = wb.groupby("ad_tag_name").sum().reset_index()

Sample:

wb = pd.DataFrame({'ad_tag_name':['he.com_300x250_bottomloopmobile'] * 3 +
                                 ['he.he.com_300x250_bottomslidemobile'] * 4, 
                   'impressions':[309, 3029,23414,1465,5725,2918,11109],
                    'ad_requests':[37849,np.nan,np.nan, 42300,np.nan, np.nan, np.nan]})

#print (wb)    

wb = wb.groupby('ad_tag_name', as_index=False).sum()
print (wb)
                           ad_tag_name  ad_requests  impressions
0      he.com_300x250_bottomloopmobile      37849.0        26752
1  he.he.com_300x250_bottomslidemobile      42300.0        21217

a = pd.Series((wb['impressions'].values * 1.0)/(wb['ad_requests'].values * 1.0), 
           index=wb['ad_tag_name']).to_dict()

print (a)
{'he.he.com_300x250_bottomslidemobile': 0.50158392434988175, 
'he.com_300x250_bottomloopmobile': 0.70680863431002139}

Also for remove \xef\xbb\xbf add encoding='utf-8-sig' to read_csv or upgrade pandas to last version, because this bug is closed and solved.

Pandas Series from DataFrame columns when the index column contains duplicates

Answers (1)

Related Questions