Reputation: 181
I am trying to pass a criteria to the aggregate function to pandas pivot_table and I am not able to figure out how to pass the criteria to the aggfunc. I have a data table which is converted to pandas df.
The input table data:
col1 | col2 | col3 | col4 | col5 | col6 | col7 |
---|---|---|---|---|---|---|
1 | test1 | t1 | Dummy1 | result1 | 10 | 102.2 |
2 | test1 | t1 | Dummy2 | result2 | 20 | 101.2 |
3 | test1 | t1 | Dummy3 | result3 | 30 | 102.3 |
4 | test1 | t1 | Dummy4 | result4 | 40 | 101.4 |
5 | test2 | t1 | Dummy1 | result1 | 10 | 100 |
6 | test2 | t1 | Dummy2 | result2 | 20 | 103 |
7 | test2 | t1 | Dummy3 | result3 | 30 | 104 |
8 | test2 | t1 | Dummy4 | result4 | 40 | 105 |
9 | test3 | t1 | Dummy1 | result1 | 10 | 102 |
10 | test3 | t1 | Dummy2 | result2 | 20 | 87 |
11 | test3 | t1 | Dummy3 | result3 | 30 | 107 |
12 | test3 | t1 | Dummy5 | result4 | 50 | 110.2 |
13 | test4 | t1 | Dummy2 | result2 | 20 | 120 |
14 | test5 | t1 | Dummy6 | result1 | 100 | 88 |
15 | test1 | t1 | Dummy1 | result2 | 10 | 106.2 |
16 | test1 | t1 | Dummy1 | result6 | 10 | 101.1 |
I want to get the maximum on col7 data, but only when the maximum is greater than 100. If any of the col7 data is greater than the user defined criteria, then all the other columns data needs to be populated irrespective if the data met the criteria or not.
Tried the following:
columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7']
df = pd.DataFrame({
'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
'col2': ['test1', 'test1', 'test1', 'test1', 'test2', 'test2', 'test2',
'test2', 'test3', 'test3', 'test3', 'test3', 'test4', 'test5',
'test1', 'test1'],
'col3': ['t1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1',
't1', 't1', 't1', 't1', 't1'],
'col4': ['Dummy1', 'Dummy2', 'Dummy3', 'Dummy4', 'Dummy1', 'Dummy2',
'Dummy3', 'Dummy4', 'Dummy1', 'Dummy2', 'Dummy3', 'Dummy5',
'Dummy2', 'Dummy6', 'Dummy1', 'Dummy1'],
'col5': ['result1', 'result2', 'result3', 'result4', 'result1', 'result2',
'result3', 'result4', 'result1', 'result2', 'result3', 'result4',
'result2', 'result1', 'result2', 'result6'],
'col6': [10, 20, 30, 40, 10, 20, 30, 40, 10, 20, 30, 50, 20, 100, 10, 10],
'col7': [100.2, 101.2, 102.3, 101.4, 100.0, 103.0, 104.0, 105.0, 102.0,
87.0, 107.0, 110.2, 120.0, 88.0, 106.2, 101.1]
})
res=df.pivot_table(values = 'col7', index = ['col4', 'col5', 'col6'], columns = ['col2'], fill_value = '', aggfunc = 'max' >= 100)
TypeError: '>=' not supported between instances of 'str' and 'int'
Output should look like:
Max pivoted output without col5:
col4 | col6 | test1 | test2 | test3 | test4 | test5 |
---|---|---|---|---|---|---|
Dummy1 | 10 | 106.2 | 100 | 102 | N/A | N/A |
Dummy2 | 20 | 101.2 | 103 | 87 | 120 | N/A |
Dummy3 | 30 | 102.3 | 104 | 107 | N/A | N/A |
Dummy4 | 40 | 101.4 | 105 | N/A | N/A | N/A |
Dummy5 | 50 | N/A | N/A | 110.2 | N/A | N/A |
Max pivoted output including col5:
col4 | col5 | col6 | test1 | test2 | test3 | test4 | test5 |
---|---|---|---|---|---|---|---|
Dummy1 | result2 | 10 | 106.2 | N/A | N/A | N/A | N/A |
Dummy1 | result1 | 10 | 102.2 | 100 | 102 | N/A | N/A |
Dummy2 | result2 | 20 | 101.2 | 103 | 87 | 120 | N/A |
Dummy3 | result3 | 30 | 102.3 | 104 | 107 | N/A | N/A |
Dummy4 | result4 | 40 | 101.4 | 105 | N/A | N/A | N/A |
Dummy5 | result4 | 50 | N/A | N/A | 110.2 | N/A | N/A |
Any guidance is much appreciated.
Thanks
Upvotes: 0
Views: 912
Reputation: 35676
You can't compare the word 'max' to 100 via >=
(aggfunc = 'max' >= 100
):
I recommend not setting the fill value to a string, masking the DataFrame, to get rid of undesired rows, then replace with empty string via fillna
:
columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7']
res = df.pivot_table(values='col7', index=['col4', 'col5', 'col6'],
columns=['col2'], aggfunc='max')
col2 test1 test2 test3 test4 test5
col4 col5 col6
Dummy1 result1 10 102.2 100.0 102.0 NaN NaN
result2 10 106.2 NaN NaN NaN NaN
result6 10 101.1 NaN NaN NaN NaN
Dummy2 result2 20 101.2 103.0 87.0 120.0 NaN
Dummy3 result3 30 102.3 104.0 107.0 NaN NaN
Dummy4 result4 40 101.4 105.0 NaN NaN NaN
Dummy5 result4 50 NaN NaN 110.2 NaN NaN
Dummy6 result1 100 NaN NaN NaN NaN 88.0
Mask for any values where any
res >= 100
and fillna
:
res = res[(res >= 100).any(1)].fillna('')
col2 test1 test2 test3 test4 test5
col4 col5 col6
Dummy1 result1 10 102.2 100.0 102.0
result2 10 106.2
result6 10 101.1
Dummy2 result2 20 101.2 103.0 87.0 120.0
Dummy3 result3 30 102.3 104.0 107.0
Dummy4 result4 40 101.4 105.0
Dummy5 result4 50 110.2
Optional reset_index
to clear the MultiIndex and rename_axis
to clear the axis name:
res[(res >= 100).any(1)].fillna('').reset_index().rename_axis(None, axis=1)
col4 col5 col6 test1 test2 test3 test4 test5
0 Dummy1 result1 10 102.2 100.0 102.0
1 Dummy1 result2 10 106.2
2 Dummy1 result6 10 101.1
3 Dummy2 result2 20 101.2 103.0 87.0 120.0
4 Dummy3 result3 30 102.3 104.0 107.0
5 Dummy4 result4 40 101.4 105.0
6 Dummy5 result4 50 110.2
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],
'col2': ['test1', 'test1', 'test1', 'test1', 'test2', 'test2', 'test2',
'test2', 'test3', 'test3', 'test3', 'test3', 'test4', 'test5',
'test1', 'test1'],
'col3': ['t1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1', 't1',
't1', 't1', 't1', 't1', 't1'],
'col4': ['Dummy1', 'Dummy2', 'Dummy3', 'Dummy4', 'Dummy1', 'Dummy2',
'Dummy3', 'Dummy4', 'Dummy1', 'Dummy2', 'Dummy3', 'Dummy5',
'Dummy2', 'Dummy6', 'Dummy1', 'Dummy1'],
'col5': ['result1', 'result2', 'result3', 'result4', 'result1', 'result2',
'result3', 'result4', 'result1', 'result2', 'result3', 'result4',
'result2', 'result1', 'result2', 'result6'],
'col6': [10, 20, 30, 40, 10, 20, 30, 40, 10, 20, 30, 50, 20, 100, 10, 10],
'col7': [102.2, 101.2, 102.3, 101.4, 100.0, 103.0, 104.0, 105.0, 102.0,
87.0, 107.0, 110.2, 120.0, 88.0, 106.2, 101.1]
})
columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7']
res = df.pivot_table(values='col7', index=['col4', 'col5', 'col6'],
columns=['col2'], aggfunc='max')
res = (
res[(res >= 100).any(1)].fillna('').reset_index().rename_axis(None, axis=1)
)
print(res)
To get the value without col5 remove it from the index
of the pivot_table
:
columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7']
res = df.pivot_table(values='col7', index=['col4', 'col6'],
columns=['col2'], aggfunc='max')
res = (
res[(res >= 100).any(1)].fillna('').reset_index().rename_axis(None, axis=1)
)
col4 col6 test1 test2 test3 test4 test5
0 Dummy1 10 106.2 100.0 102.0
1 Dummy2 20 101.2 103.0 87.0 120.0
2 Dummy3 30 102.3 104.0 107.0
3 Dummy4 40 101.4 105.0
4 Dummy5 50 110.2
Upvotes: 2
Reputation: 14949
Or you can try:
res = df.assign(col7 = df.col7.where(df.col7 > 100)).pivot_table(values='col7', index=['col4', 'col5', 'col6'],
columns=['col2'], aggfunc='max', fill_value= '')
Upvotes: 0