watchtower
watchtower

Reputation: 4298

Including missing combination of values based on a group of grouped data

I am expanding on earlier thread: Including missing combinations of values in a pandas groupby aggregation

In above thread, the accepted answer computes all possible combinations for the grouping variable. In this version, I'd like to compute combinations based on group of groups.

Let's take an example.

Here's input dataframe:

enter image description here

Here, one group is [Year,Quarter] i.e.

Year  Quarter
2014  Q1
2015  Q2
2015  Q3

Another set of group is Name:

Name
Adam
Smith

Now, I want to apply groupby and sum such that missing values of the combination of above groups is detected as NaN

Here's sample output:

enter image description here

I'd appreciate any help.


Here's sample input and output in dict format:

input=
{'Year': {0: 2014, 1: 2014, 2: 2015, 3: 2015, 4: 2015},
 'Quarter': {0: 'Q1', 1: 'Q1', 2: 'Q2', 3: 'Q2', 4: 'Q3'},
 'Name': {0: 'Adam', 1: 'Smith', 2: 'Adam', 3: 'Adam', 4: 'Smith'},
 'Value': {0: 2, 1: 3, 2: 4, 3: 5, 4: 5}}

output=
{'Year': {0: 2014, 1: 2014, 2: 2015, 3: 2015, 4: 2015, 5: 2015},
 'Quarter': {0: 'Q1', 1: 'Q1', 2: 'Q2', 3: 'Q2', 4: 'Q3', 5: 'Q3'},
 'Name': {0: 'Adam', 1: 'Smith', 2: 'Adam', 3: 'Smith', 4: 'Smith', 5: 'Adam'},
 'Value': {0: 2.0, 1: 3.0, 2: 9.0, 3: nan, 4: 5.0, 5: nan}}

Clarification:

I am looking for a method without doing melt and cast. i.e. without playing around with long and wide format.

Upvotes: 1

Views: 294

Answers (2)

BENY
BENY

Reputation: 323356

Using pivot_table, PS you can add reset_index at the end

df.pivot_table(index=['Year','Quarter'],columns='Name',values='Value',aggfunc='sum').stack(dropna=False)
Year  Quarter  Name 
2014  Q1       Adam     2.0
               Smith    3.0
2015  Q2       Adam     9.0
               Smith    NaN
      Q3       Adam     NaN
               Smith    5.0
dtype: float64

Upvotes: 1

It_is_Chris
It_is_Chris

Reputation: 14113

The example post you posted is the correct answer: groupby get the sum then unstack to find the missing value then stack with the param dropna=False here are the docs on stack

df.groupby(['Year','Quarter','Name']).sum().unstack().stack(dropna=False).reset_index()

    Year    Quarter   Name   Value
0   2014    Q1        Adam   2.0
1   2014    Q1        Smith  3.0
2   2015    Q2        Adam   9.0
3   2015    Q2        Smith  NaN
4   2015    Q3        Adam   NaN
5   2015    Q3        Smith  5.0

Upvotes: 1

Related Questions