Reputation: 984

How can I add missing rows to a pandas DataFrame depending on a date and a combination of two columns?

I have a scenario in which I have a dataframe which contains 4 columns:

date, product, store, sales_amt

1/1/2019, A,A,200

1/1/2019,A,B,120

1/2/2019, A,A,75

1/3/2019,A,A,69

1/3/2019,A,B,23

--
--
--

1/31/2019,A,B,49

The dates are supposed to span a whole month (e.g. in this case, January 2019) but there are some missing days in the dataframe.

Does anyone have any tips on Python code that can loop through the dates for a particular month and add a new row to the dataframe with the missing date, product/store combination and a sales_amt of zero?

For example, there is no entry for the product/store combination of A/B on 1/2/2019.

The goal at the end is to have an entry for every day of that month for every product/store combination.

How can I do this?

Upvotes: 2

Answers (3)

Haleemur Ali

Reputation: 28233

how can i still ensure that i have a a date entry for each store/product entry?

The trick I'd use is to pivot the column combination for which I missing values are desired, then resample to generate the missing dates, fill the na with 0 & finally reshape back to original shape & reset_index

Here's a short script demonstrating this:

import pandas as pd
import numpy as np

products = ['tablet', 'laptop', 'phone']
stores = ['downtown', 'subburb', 'supermall']

date_range = pd.date_range('2019-01-01', '2019-03-31')

# create a sample data frame
df = pd.DataFrame({
  'date': date_range,
  'product': np.random.choice(products, len(date_range)),
  'store': np.random.choice(stores, len(date_range)),
  'sales_amt': np.random.normal(50, 10, len(date_range))
})
# remove some dates
df = df[~df.date.isin(['2019-01-10', '2019-01-11', '2019-02-07'])]

# set date as index, pivot product & store, fill na with 0
# reindex & unpivot

shape2 = df.set_index(['date', 'product', 'store']).\
  unstack([1, 2]).\
  resample('D').asfreq().\
  fillna(0).\
  stack([1, 2]).\
  reset_index()

print('%d unique dates in original df' % df['date'].nunique())
print('%d rows in original df' % len(df))
print('%d unique dates after filling missing values' % shape2['date'].nunique())
print('%d rows in after filling missing values' % len(shape2))

shape2.head()

It is also possible to do this through a left join, where first a dataframe of all the desired combinations is created, and then the source data frame is left joined. This approach takes marginally longer to express, but would be more intuitive for folks coming from an SQL mind-set.

cross_product = pd.MultiIndex.from_product([
    pd.date_range(df.date.min(), df.date.max()),
    df['product'].unique(),
    df.store.unique()]
    ).to_frame().\
    reset_index(drop=True).\
    rename({0:'date', 1:'product', 2:'store'}, axis=1)

final_df = cross_product.merge(df, 
    left_on=['date', 'product', 'store'], 
    right_on=['date', 'product', 'store'], 
    how='left').fillna(0)

Upvotes: 1

Akash sharma

Reputation: 141

I believe this can be done by creating an index of dates for entire month (in your case all dates between the first date and the last date) and then left merging it on the original dataframe.

So, if 'df' is the original dataframe containing date, product, store and sales_amt then adding following code should help.

all_dates = pd.DataFrame(pd.date_range(df['date'].min(), df['date'].max()))
all_dates.rename(columns = {0: 'date'}, inplace = True)
df = all_dates.merge(df, on = 'date', how = 'left')

Upvotes: 0

Scott Boston

Reputation: 153460

Use resample after set_index:

#create a dummy dataframe with data every other day
s=pd.date_range('2019-01-01', '2019-05-01', freq='2D')
df = pd.DataFrame({'Date':s, 'sales_amt':np.random.randint(100,1000,61)})

df.set_index('Date').resample('D').asfreq().fillna(0)

Output:

            sales_amt
Date                 
2019-01-01      996.0
2019-01-02        0.0
2019-01-03      236.0
2019-01-04        0.0
2019-01-05      225.0
...               ...
2019-04-27      444.0
2019-04-28        0.0
2019-04-29      756.0
2019-04-30        0.0
2019-05-01      641.0

Upvotes: 3

How can I add missing rows to a pandas DataFrame depending on a date and a combination of two columns?

Answers (3)

Related Questions