Reputation: 3225
I initially wrote some code in python 2.7, but now I switched to python 3.5. I want to aggregate numeric data from a couple of columns and grouping by the rest of them or at least one.
Here's my initial dataframe "testdf":
testdf
PROD_TAG BRAND Market ('VAL', 'Per1') ('VAL', 'Per2')
P_1 A Modern Trade 4.3 0.155
P_2 A Traditional Trade 5.7 0
P_3 B Modern Trade 10.0 11.2
P_3 B Traditional Trade 8.7 6.3
P_4 C Modern Trade 12.1 12.3
P_5 D Modern Trade 8.0 7.0
Last two column headers are tuples (thanks captain obvious). Per1 and Per2 stands for respective periods.
I want to execute a line of code, which worked previously on python 2.7:
testdf=testdf.groupby(['BRAND'])[('VAL','P1'),('VAL','P2')].sum()
It doesn't work because of tuple type of column headers and raises:
TypeError: unorderable types: tuple() < str()
Now if I rename column headers like this:
testdf.columns=['PROD_TAG', 'BRAND', 'Market', 'VAL-P1', 'VAL-P2']
(removing tuples) I'll be able to execute the same line of code with new columns names:
testdf1=testdf.groupby(['BRAND'])['VAL-P1','VAL-P2'].sum()
and get in the end:
BRAND ('VAL', 'Per1') ('VAL', 'Per2')
A 10.0 0.155
B 18.7 17.5
C 12.1 12.3
D 8.0 7.0
Most wierd thing here is that if I use .mean()
instead of .sum(), min() or .max()
, everything works fine even with tuples.
Can anybody explain me how can I make such aggregations with tuple columns names work on python 3.5?
Upvotes: 3
Views: 1535
Reputation: 29711
I think you need to use groupby.agg
and pass a function to aggregate the sum of each group as shown:
df = pd.DataFrame({'PROD_TAG':["P_1", "P_2", "P_3", "P_3", "P_4", "P_5"],
'BRAND':["A", "A", "B", "B", "C", "D"],
'Market':["Modern Trade", "Traditional Trade", \
"Modern Trade", "Traditional Trade", "Modern Trade", "Modern Trade"],
('VAL','Per1'):[4.3, 5.7, 10.0, 8.7, 12.1, 8.0],
('VAL','Per2'):[0.155, 0, 11.2, 6.3, 12.3, 7.0]})
type(df[('VAL','Per1')].name)
#<class 'tuple'>
df.groupby(['BRAND'])[('VAL','Per1'), ('VAL','Per2')].agg(lambda x: x.sum())
(VAL, Per1) (VAL, Per2)
BRAND
A 10.0 0.155
B 18.7 17.500
C 12.1 12.300
D 8.0 7.000
Alternatively, the index is not reset and the grouper columns are transformed. So, you could get rid of the TypeError
due to the name mismatch of the columns[tuple/str
].
df.groupby(['BRAND'], as_index=False)[('VAL','Per1'), ('VAL','Per2')].sum()
BRAND (VAL, Per1) (VAL, Per2)
0 A 10.0 0.155
1 B 18.7 17.500
2 C 12.1 12.300
3 D 8.0 7.000
But if you rename
the tuple
columns to string
, you can continue as before without using agg
functions:
df.rename(index=str, columns={('VAL','Per1'): "('VAL','Per1')", \
('VAL','Per2'): "('VAL','Per2')"}, inplace=True)
type(df["('VAL','Per1')"].name)
#<class 'str'>
df.groupby(['BRAND'])["('VAL','Per1')","('VAL','Per2')"].sum()
('VAL','Per1') ('VAL','Per2')
BRAND
A 10.0 0.155
B 18.7 17.500
C 12.1 12.300
D 8.0 7.000
Note: Tested in Python 3.5
Upvotes: 2