Create symmetrical pandas GroupBy with desired values

Question

consider the following data.

I have some missing values on the data, which is,

3 data points on P1 - S1
3 data points on P1 - S3
3 data points on P2 - S4

Is there a GroupBy method that could fill up the data so it is symmetrical? Symmetrical in the sense of, it is grouped by ['Project', 'Stage'] so data that does not have that category expands by filling them up with predefined/calculated values (In this example, value is 0). To be more clear, please see the desired output at the bottom.

I browsed the docs but I got lost in the sea of docs. Would really appreciate some tips or directions! Thanks!

import pandas as pd
from io import StringIO
import numpy as np

data = """  x   y   Project Stage
0   -10.540327  61.649657999999995  P1  S2
1   32.379704   -130.71579599999998 P1  S2
2   -2.620516   71.260701   P1  S2
3   13.978546   31.711221000000002  P1  S4
4   41.391873   88.070636   P1  S4
5   -0.303671   -84.376145  P1  S4
6   -65.370553  -7.694756   P2  S3
7   9.062894    177.233938  P2  S3
8   17.029269   -100.364923 P2  S3
9   13.354712   73.78603100000001   P2  S2
10  18.781817999999998  -109.3554   P2  S2
11  67.189193   113.94348000000001  P2  S2
12  8.762746    128.735548  P2  S1
13  59.287251   11.017662   P2  S1
14  -2.4981430000000002 -166.897776 P2  S1
15  1.1900870000000001  -82.309501  P3  S3
16  21.938575   -118.20861699999999 P3  S3
17  45.414544   122.49246000000001  P3  S3
18  -36.322626  -63.310702  P3  S2
19  53.568994999999994  141.37641100000002  P3  S2
20  26.129929999999998  153.706601  P3  S2
21  -3.9384330000000003 102.90164399999999  P3  S6
22  -8.198288   137.616797  P3  S6
23  -8.240618   -146.80676  P3  S6
24  15.193395   25.697023   P3  S5
25  20.886185   68.356337   P3  S5
26  -22.101129999999998 89.639481   P3  S5"""

df = pd.read_csv(StringIO(data), sep='	', index_col=0)

df.groupby(['Project', 'Stage']).count()

It returns:

My current workaround

def create_df(p, s):
    return pd.DataFrame({
        'Project': [p] * 3,
        'Stage': [s] * 3
    })

df.append([
    create_df('P1', 'S1'),
    create_df('P1', 'S3'),
    create_df('P2', 'S4')
]).groupby(['Project', 'Stage']).count()

Is there an elegant way in pandas to fill up other data so it resulted as this:

David Erickson · Accepted Answer

Create iterable arrays for the two columns you are grouping by with (see i below)
Do your groupby that you did in your question
Use .reindex(index=pd.MultiIndex.from_product() on your dataframe and pass i from step 1 and the two columns you have grouped by. Finally, fill with 0 and use .sort_values by the relevant columns.

code:

i = [df['Project'].unique(), df['Stage'].unique()]
df = df.groupby(['Project', 'Stage']).count()
df = df.reindex(index=pd.MultiIndex.from_product(i, names=['Project', 'Stage']),
              fill_value=0).reset_index().sort_values(['Project', 'Stage'])
df

output:

    Project Stage   x   y
3   P1      S1      0   0
0   P1      S2      3   3
2   P1      S3      0   0
1   P1      S4      3   3
5   P1      S5      0   0
4   P1      S6      0   0
9   P2      S1      3   3
6   P2      S2      3   3
8   P2      S3      3   3
7   P2      S4      0   0
11  P2      S5      0   0
10  P2      S6      0   0
15  P3      S1      0   0
12  P3      S2      3   3
14  P3      S3      3   3
13  P3      S4      0   0
17  P3      S5      3   3
16  P3      S6      3   3

Create symmetrical pandas GroupBy with desired values

It returns:

My current workaround

Is there an elegant way in pandas to fill up other data so it resulted as this:

Answers (1)

Related Questions