Bal Krishna Jha
Bal Krishna Jha

Reputation: 7206

Pandas Groupby : TypeError: unsupported operand type(s) for -: 'str' and 'str'

I've tried another solution with this error on SO they all are related to Python input or raw_input and didn't solve my problem.

txt = '''series NAME VAL1 VAL2 
0 AAA 27 678 
1 BBB 45 744
2 CCC 34 275
3 AAA 29 932
4 CCC 47 288
5 BBB 24 971
'''
df = pd.read_table(StringIO(txt),sep = '\s+')
del df['series']
df = df.groupby('NAME').apply(lambda x: x.max()-x.min())

TypeError: unsupported operand type(s) for -: 'str' and 'str'

But if I check individually (max, min) they work. I've checked the type of columns VAL1 and VAL2and they are of int64 type

Upvotes: 1

Views: 4974

Answers (1)

cs95
cs95

Reputation: 402333

This is a bug up until v0.22. From v0.23 onwards, non-numeric columns are ignored by default.

Unfortunately, groupby.apply will attempt to run your lambda on every column, including the column you've grouped on ("NAME", which is a string).

You can confirm by checking the difference between

df.groupby('NAME')[['VAL1', 'VAL2']].apply(lambda x: x.max() - x.min())

      VAL1  VAL2
NAME            
AAA      2   254
BBB     21   227
CCC     13    13

Versus

df.groupby('NAME')['NAME'].apply(lambda x: x.max() - x.min())
---------------------------------------------------------------------------
TypeError                  

Basically, explicit is better than implicit.

Alternatively, select all numeric columns and pass a Series as the grouper (note that this is slower than grouping on a column that belongs to the DataFrame), but this means you don't have to list out each column individually.

df.select_dtypes('number').groupby(df.NAME).apply(lambda x: x.max() - x.min())

      VAL1  VAL2
NAME            
AAA      2   254
BBB     21   227
CCC     13    13

Thanks to @JC.

Upvotes: 2

Related Questions