Reputation: 1061
Looking to preserve NaN values when changing the shape of the dataframe.
These two questions may be related:
but not been able to use the answers provided - can I set a min count for np.sum somehow?
import pandas as pd
import numpy as np
df = pd.DataFrame([['Y1', np.nan], ['Y2', np.nan], ['Y1', 6], ['Y2',8]], columns=['A', 'B'], index=['1988-01-01','1988-01-01', '1988-01-04', '1988-01-04'])
df.index.name = 'Date'
df
pivot_df = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],aggfunc=np.sum)
pivot_df
The output is:
A Y1 Y2
Date
1988-01-01 0.0 0.0
1988-01-04 6.0 8.0
and the desired output is:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
Upvotes: 6
Views: 12694
Reputation: 1144
I also wanted to keep NaN values, and I also wanted to keep using the pivot_table
function. So here is my solution:
pivot_df = pd.pivot_table(
df,
values='B',
index=['Date'],
columns=['A'],
aggfunc=lambda x: x.sum(min_count=1),
dropna=False
)
The downside is that this is less efficient in terms of computation time.
Upvotes: 0
Reputation: 51155
If you have no duplicate entries, use set_index
+ unstack
df.set_index('A', append=True)['B'].unstack(-1)
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
If you have duplicates, use a groupby
with min_count
>> df
A B
Date
1988-01-01 Y1 NaN
1988-01-01 Y2 NaN
1988-01-04 Y1 6.0
1988-01-04 Y2 8.0
1988-01-01 Y1 NaN
1988-01-01 Y2 NaN
1988-01-04 Y1 6.0
1988-01-04 Y2 8.0
df.set_index('A', append=True).groupby(level=[0, 1])['B'].sum(min_count=1).unstack(-1)
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 12.0 16.0
Upvotes: 1
Reputation: 148975
It is possible to count the values, and drop when 0 (or less than the expected count):
pivot_df = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],
aggfunc=['sum','count'])
# build the mask from count
mask = (pivot_df.xs('count', axis=1) == 0) # or ...<min_limit
#build the actual pivot_df from sum
pivot_df = pivot_df.xs('sum', axis=1)
# and reset to NaN when not enough values
pivot_df[mask] = np.nan
It gives as expected:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
This one will give sensible result when you sum more than one value.
Upvotes: 1
Reputation: 1061
From the helpful comments the following solution meets my requirements:
pivot_df_2 = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],aggfunc=min, dropna=False)
pivot_df_2
Values are supposed to be unique per slot so replacing the sum function with a min function shouldn't make a difference (in my case)
Upvotes: 3
Reputation: 150765
In this case, I would resolve by groupby
:
(df.groupby(['Date', 'A']).B
.apply(lambda x: np.nan if x.isna().all() else x.sum())
.unstack('A')
)
output:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
Change isna().all()
to isna().any()
if needed.
Upvotes: 1