Reputation: 433
Good morning,
right now my dataframe is looking something like this:
date | type | value
2021-01-01 | extern | 17
2021-01-01 | intern | 19.5
2021-01-02 | extern | 104
2021-01-02 | intern | 8
2021-01-03 | extern | 17.4
2021-01-04 | intern | 34
As you may see, the dates 2021-01-03 and 2021-01-04 are missing rows, because neither of both have extern and intern values. I wonder if it's possible to turn this dataframe into something like this:
date | type | value
2021-01-01 | extern | 17
2021-01-01 | intern | 19.5
2021-01-02 | extern | 104
2021-01-02 | intern | 8
2021-01-03 | extern | 17.4
2021-01-03 | intern | 0
2021-01-04 | intern | 34
2021-01-04 | extern | 0
Notes:
Hope someone might know a solution for this kind of situation!? Thanks for all our help and a great day!
Upvotes: 0
Views: 723
Reputation: 28644
You could use the complete function from pyjanitor to expose the missing combinations; at the moment you have to install the latest development version from github:
# pip install pyjanitor
import janitor
df.complete(["date", "type"], fill_value={"value": 0})
date type value
0 2021-01-01 extern 17.0
1 2021-01-01 intern 19.5
2 2021-01-02 extern 104.0
3 2021-01-02 intern 8.0
4 2021-01-03 extern 17.4
5 2021-01-03 intern 0.0
6 2021-01-04 extern 0.0
7 2021-01-04 intern 34.0
Using Pandas' only, we can create unique values for columns 'A' and 'B", build a new MultiIndex, then reindex the dataframe:
new_index = pd.MultiIndex.from_product(
[df.date.unique(), df.type.unique()], names=["date", "type"])
new_index
MultiIndex([('2021-01-01', 'extern'),
('2021-01-01', 'intern'),
('2021-01-02', 'extern'),
('2021-01-02', 'intern'),
('2021-01-03', 'extern'),
('2021-01-03', 'intern'),
('2021-01-04', 'extern'),
('2021-01-04', 'intern')],
names=['date', 'type'])
Now, set index, reindex and reset index:
df.set_index(["date", "type"]).reindex(new_index, fill_value=0).reset_index()
date type value
0 2021-01-01 extern 17.0
1 2021-01-01 intern 19.5
2 2021-01-02 extern 104.0
3 2021-01-02 intern 8.0
4 2021-01-03 extern 17.4
5 2021-01-03 intern 0.0
6 2021-01-04 extern 0.0
7 2021-01-04 intern 34.0
the complete function from pyjanitor is an abstraction of the above steps, with some more options, especially for nested combinations and completely new values.
Upvotes: 3
Reputation: 2122
For Huge datasets, Use : pd.factorize + np.repeat + np.tile + pd.merge
import pandas as pd
import numpy as np
# Create a list of dictionaries
data = [
{"date": "1990-01-01", "type": "extern", "value": 17},
{"date": "1990-01-01", "type": "intern", "value": 19.5},
{"date": "1990-01-02", "type": "extern", "value": 104},
{"date": "1990-01-02", "type": "intern", "value": 8},
{"date": "1990-01-03", "type": "extern", "value": 17.4},
{"date": "1990-01-04", "type": "intern", "value": 34},
]
df = pd.DataFrame(data)
print(df)
"""
date type value
0 1990-01-01 extern 17.0
1 1990-01-01 intern 19.5
2 1990-01-02 extern 104.0
3 1990-01-02 intern 8.0
4 1990-01-03 extern 17.4
5 1990-01-04 intern 34.0
"""
date_codes,date_unique = pd.factorize(df['date'])
type_codes,type_unique = pd.factorize(df['type'])
date_repeated = np.repeat(date_unique, len(type_unique))
type_tiled = np.tile(type_unique,len(date_unique))
cartesian_product = pd.DataFrame({
'date' : date_repeated, 'type' : type_tiled })
print(cartesian_product)
"""
Date Type
0 1990-01-01 extern
1 1990-01-01 intern
2 1990-01-02 extern
3 1990-01-02 intern
4 1990-01-03 extern
5 1990-01-03 intern
6 1990-01-04 extern
7 1990-01-04 intern
"""
merged_df = pd.merge(cartesian_product, df, on = ['date','type'],how = 'left').fillna(0)
print(merged_df)
"""
date type value
0 1990-01-01 extern 17.0
1 1990-01-01 intern 19.5
2 1990-01-02 extern 104.0
3 1990-01-02 intern 8.0
4 1990-01-03 extern 17.4
5 1990-01-03 intern 0.0
6 1990-01-04 extern 0.0
7 1990-01-04 intern 34.0
"""
Upvotes: 0