Python, Pandas: Add default for missing values

Question

Good morning,

right now my dataframe is looking something like this:

date        |    type    |    value
2021-01-01  |    extern  |    17
2021-01-01  |    intern  |    19.5
2021-01-02  |    extern  |    104
2021-01-02  |    intern  |    8
2021-01-03  |    extern  |    17.4
2021-01-04  |    intern  |    34

As you may see, the dates 2021-01-03 and 2021-01-04 are missing rows, because neither of both have extern and intern values. I wonder if it's possible to turn this dataframe into something like this:

date        |    type    |    value
2021-01-01  |    extern  |    17
2021-01-01  |    intern  |    19.5
2021-01-02  |    extern  |    104
2021-01-02  |    intern  |    8
2021-01-03  |    extern  |    17.4
2021-01-03  |    intern  |    0
2021-01-04  |    intern  |    34
2021-01-04  |    extern  |    0

Notes:

the amount of values in type is limited to - for example - these two (extern and intern). There is no other value that may pop up.

Hope someone might know a solution for this kind of situation!? Thanks for all our help and a great day!

sammywemmy · Accepted Answer

You could use the complete function from pyjanitor to expose the missing combinations; at the moment you have to install the latest development version from github:

# pip install pyjanitor
import janitor

df.complete(["date", "type"], fill_value={"value": 0})

     date      type     value
0   2021-01-01  extern  17.0
1   2021-01-01  intern  19.5
2   2021-01-02  extern  104.0
3   2021-01-02  intern  8.0
4   2021-01-03  extern  17.4
5   2021-01-03  intern  0.0
6   2021-01-04  extern  0.0
7   2021-01-04  intern  34.0

Using Pandas' only, we can create unique values for columns 'A' and 'B", build a new MultiIndex, then reindex the dataframe:

new_index = pd.MultiIndex.from_product(
[df.date.unique(), df.type.unique()], names=["date", "type"])

 new_index

 MultiIndex([('2021-01-01', 'extern'),
                ('2021-01-01', 'intern'),
                ('2021-01-02', 'extern'),
                ('2021-01-02', 'intern'),
                ('2021-01-03', 'extern'),
                ('2021-01-03', 'intern'),
                ('2021-01-04', 'extern'),
                ('2021-01-04', 'intern')],
           names=['date', 'type'])

Now, set index, reindex and reset index:

df.set_index(["date", "type"]).reindex(new_index, fill_value=0).reset_index()

    date    type    value
0   2021-01-01  extern  17.0
1   2021-01-01  intern  19.5
2   2021-01-02  extern  104.0
3   2021-01-02  intern  8.0
4   2021-01-03  extern  17.4
5   2021-01-03  intern  0.0
6   2021-01-04  extern  0.0
7   2021-01-04  intern  34.0

the complete function from pyjanitor is an abstraction of the above steps, with some more options, especially for nested combinations and completely new values.

Python, Pandas: Add default for missing values

Answers (2)

Related Questions