finethen
finethen

Reputation: 433

Python, Pandas: Add default for missing values

Good morning,

right now my dataframe is looking something like this:

date        |    type    |    value
2021-01-01  |    extern  |    17
2021-01-01  |    intern  |    19.5
2021-01-02  |    extern  |    104
2021-01-02  |    intern  |    8
2021-01-03  |    extern  |    17.4
2021-01-04  |    intern  |    34

As you may see, the dates 2021-01-03 and 2021-01-04 are missing rows, because neither of both have extern and intern values. I wonder if it's possible to turn this dataframe into something like this:

date        |    type    |    value
2021-01-01  |    extern  |    17
2021-01-01  |    intern  |    19.5
2021-01-02  |    extern  |    104
2021-01-02  |    intern  |    8
2021-01-03  |    extern  |    17.4
2021-01-03  |    intern  |    0
2021-01-04  |    intern  |    34
2021-01-04  |    extern  |    0

Notes:

Hope someone might know a solution for this kind of situation!? Thanks for all our help and a great day!

Upvotes: 0

Views: 723

Answers (2)

sammywemmy
sammywemmy

Reputation: 28644

You could use the complete function from pyjanitor to expose the missing combinations; at the moment you have to install the latest development version from github:

# pip install pyjanitor
import janitor

df.complete(["date", "type"], fill_value={"value": 0})

     date      type     value
0   2021-01-01  extern  17.0
1   2021-01-01  intern  19.5
2   2021-01-02  extern  104.0
3   2021-01-02  intern  8.0
4   2021-01-03  extern  17.4
5   2021-01-03  intern  0.0
6   2021-01-04  extern  0.0
7   2021-01-04  intern  34.0

Using Pandas' only, we can create unique values for columns 'A' and 'B", build a new MultiIndex, then reindex the dataframe:

new_index = pd.MultiIndex.from_product(
[df.date.unique(), df.type.unique()], names=["date", "type"])

 new_index

 MultiIndex([('2021-01-01', 'extern'),
                ('2021-01-01', 'intern'),
                ('2021-01-02', 'extern'),
                ('2021-01-02', 'intern'),
                ('2021-01-03', 'extern'),
                ('2021-01-03', 'intern'),
                ('2021-01-04', 'extern'),
                ('2021-01-04', 'intern')],
           names=['date', 'type'])

Now, set index, reindex and reset index:

df.set_index(["date", "type"]).reindex(new_index, fill_value=0).reset_index()

    date    type    value
0   2021-01-01  extern  17.0
1   2021-01-01  intern  19.5
2   2021-01-02  extern  104.0
3   2021-01-02  intern  8.0
4   2021-01-03  extern  17.4
5   2021-01-03  intern  0.0
6   2021-01-04  extern  0.0
7   2021-01-04  intern  34.0

the complete function from pyjanitor is an abstraction of the above steps, with some more options, especially for nested combinations and completely new values.

Upvotes: 3

Soudipta Dutta
Soudipta Dutta

Reputation: 2122

For Huge datasets, Use : pd.factorize + np.repeat + np.tile + pd.merge

import pandas as pd
import numpy as np
# Create a list of dictionaries
data = [
    {"date": "1990-01-01", "type": "extern", "value": 17},
    {"date": "1990-01-01", "type": "intern", "value": 19.5},
    {"date": "1990-01-02", "type": "extern", "value": 104},
    {"date": "1990-01-02", "type": "intern", "value": 8},
    {"date": "1990-01-03", "type": "extern", "value": 17.4},
    {"date": "1990-01-04", "type": "intern", "value": 34},
]


df = pd.DataFrame(data)
print(df)
"""
         date    type  value
0  1990-01-01  extern   17.0
1  1990-01-01  intern   19.5
2  1990-01-02  extern  104.0
3  1990-01-02  intern    8.0
4  1990-01-03  extern   17.4
5  1990-01-04  intern   34.0
"""
date_codes,date_unique  = pd.factorize(df['date'])
type_codes,type_unique  = pd.factorize(df['type'])

date_repeated = np.repeat(date_unique, len(type_unique))
type_tiled    = np.tile(type_unique,len(date_unique))

cartesian_product = pd.DataFrame({
'date' : date_repeated, 'type' : type_tiled })

print(cartesian_product)
"""
        Date    Type
0  1990-01-01  extern
1  1990-01-01  intern
2  1990-01-02  extern
3  1990-01-02  intern
4  1990-01-03  extern
5  1990-01-03  intern
6  1990-01-04  extern
7  1990-01-04  intern
"""

merged_df = pd.merge(cartesian_product, df, on = ['date','type'],how = 'left').fillna(0)

print(merged_df)

"""
         date    type  value
0  1990-01-01  extern   17.0
1  1990-01-01  intern   19.5
2  1990-01-02  extern  104.0
3  1990-01-02  intern    8.0
4  1990-01-03  extern   17.4
5  1990-01-03  intern    0.0
6  1990-01-04  extern    0.0
7  1990-01-04  intern   34.0

"""

Upvotes: 0

Related Questions