Jaskaran Singh Puri
Jaskaran Singh Puri

Reputation: 739

Python dataframe how to group by one column and get sum of other column

I want to create a new data frame which has 2 columns, grouped by Striker_Id and other column which has sum of 'Batsman_Scored' corresponding to the grouped 'Striker_Id'

Eg:

Striker_ID  Batsman_Scored
1            0
2            8 
...

enter image description here

I tried this ball.groupby(['Striker_Id'])['Batsman_Scored'].sum() but this is what I get:

Striker_Id
1      0000040141000010111000001000020000004001010001...
2      0000000446404106064011111011100012106110621402...
3      0000121111114060001000101001011010010001041011...
4      0114110102100100011010000000006010011001111101...
5      0140016010010040000101111100101000111410011000...
6      1100100000104141011141001004001211200001110111...

It doesn't sum, only joins all the numbers. What's the alternative?

Upvotes: 1

Views: 2229

Answers (1)

cs95
cs95

Reputation: 402483

For some reason, your columns were loaded as strings. While loading them from a CSV, try applying a converter -

df = pd.read_csv('file.csv', converters={'Batsman_Scored' : int})

Or,

df = pd.read_csv('file.csv', converters={'Batsman_Scored' : pd.to_numeric})

If that doesn't work, then convert to integer after loading -

df['Batsman_Scored'] = df['Batsman_Scored'].astype(int)

Or,

df['Batsman_Scored'] = pd.to_numeric(df['Batsman_Scored'], errors='coerce')

Now, performing the groupby should work -

r = df.groupby('Striker_Id')['Batsman_Scored'].sum() 

Without access to your data, I can only speculate. But it seems like, at some point, your data contains non-numeric data that prevents pandas from being able to perform conversions, resulting in those columns being retained as strings. It's a little difficult to pinpoint this problematic data until you actually load it in and do something like

df.col.str.isdigit().any()

That'll tell you if there are any non-numeric items. Note that it only works for integers, float columns cannot be debugged like this.

Also, another way of seeing what columns have corrupt data would be to query dtypes -

df.dtypes

Which will give you a listing of all columns and their datatypes. Use this to figure out what columns need parsing -

for c in df.columns[df.dtypes == object]:
    print(c)

You can then apply the methods outlined above to fix them.

Upvotes: 1

Related Questions