Reputation: 67
I am trying to groupby a dataframe with 4000+ rows by 'summing' a column called 'From Email Address'.
sum_df = df.groupby(['From Email Address' ], as_index=False).agg({'Name':'first', 'From Email Address':'first',
'Country':'first', 'Subject':'first','Gross': 'sum'}) #grouped by dataframe'
An 'From email address' called '[email protected]' has 200+ entries(rows) in the file. The value returned for that particular row is peculiar:
-10-10-10-49-10-8-30-5.8-79-90-10-10-49-10-468-7-90-10-79-5-10-6.66-1,500.00-49-10-10-8-49-4.99-10-10-10-2-4.74-49-49-90-7.5-35-10-90-7.5-7.5-17.15-10-9-42.5-10-10-468-49-33.2-49-10-15.8-10-3.22-49-5-140.4-6.13-10-7.75-10-51.35-10-49-90-3.22-90-10-5-2.9-6-79-74-6.66-49-468-10-10-49-8.38-10-2.6-149-250-10-31.6-10-49-7-10-10-5-36.5-66.2-49-3-35.7-10-10-10-5-49-5-10-49-10-10-79-406.38-29.5-10-49-49-90-10-10-10-49-50-4.7-25.48-10-3.22-10-10-10-3.22-1.6-49-49-49-49-49
whereas all other rows outputs correct value. Can anyone please give insight about how to solve this? Thanks!
I tried:
sum_df2 = df.groupby('From Email Address', as_index=False)['Gross'].sum()
The same value returned for this too.
Upvotes: 1
Views: 641
Reputation: 2348
It appears that the numbers are being stored as strings. For example, there are fragments like 1,500.00
.
Because of the way the +
operator acts on strings (concatenation), the summation is concatenating a bunch of them.
The solution is to make sure you first convert your data to numeric types.
Upvotes: 1