Reputation: 862
I have a huge pandas dataframe, shaped like this example:
new_id hour names values
0 0 mark 5
0 0 matt 4
0 0 alex 3
1 0 roger 2
1 0 arthur 7
1 1 alf 8
2 1 ale 6
3 1 peter 5
3 2 tom 2
4 2 andrew 7
I need to reshape it, so I use pivot_table()
:
dummy=dummy.pivot_table(index=['hour','new_id'],columns='name', values='values').fillna(0)
so it becomes
names ale alex alf andrew arthur mark matt peter roger tom
hour new_id
0 0 0.0 3.0 0.0 0.0 0.0 5.0 4.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 7.0 0.0 0.0 0.0 2.0 0.0
1 1 0.0 0.0 8.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 6.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0
2 3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0
4 0.0 0.0 0.0 7.0 0.0 0.0 0.0 0.0 0.0 0.0
....
By the way, this small example can't reproduce my problem: in the real dataset, when I pivot it, I get some float values which shouldn't be there, since they are the aggregation and the sum of the values
of the initial dataset, which are all integers. Not only they are float, but they are also quite far from the exact results.
Why do I get these float values? Is there a better way to get what I want? I don't really want to write by myself a function to sum properly all the values before pivoting the dataframe, since this should be exactly what pivot_table()
does.
Upvotes: 8
Views: 4692
Reputation: 862771
There is problem NaN
s, which convert all values to float
s so possible solution is add parameter fill_value=0
if input data are integer
s:
dummy=dummy.pivot_table(index=['hour','new_id'],columns='name', values='values', fill_value=0)
print (dummy)
name ale alex alf andrew arthur mark matt peter roger tom
hour new_id
0 0 0 3 0 0 0 5 4 0 0 0
1 0 0 0 0 7 0 0 0 2 0
1 1 0 0 8 0 0 0 0 0 0 0
2 6 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 5 0 0
2 3 0 0 0 0 0 0 0 0 0 2
4 0 0 0 7 0 0 0 0 0 0
Default aggregate function in pivot_table
is mean
, so is expected at least one float value in output, so it convert all values to float
s.
So if change aggregate function to sum
all working nice:
dummy = dummy.pivot_table(index=['hour','new_id'],
columns='name',
values='values',
fill_value=0,
aggfunc='sum')
Upvotes: 11