scott martin
scott martin

Reputation: 1293

Pandas - Comparing two dataframes by date and find missing entires

I have a Dataframe with list of unique working dates in a month, called days.

I have another Dataframe that has the list of days an employee has reported to work, called reported.

I am trying to find days when an employee did not report to work.

Given below is sample view of the Dataframes:

days
11/1/2018
11/2/2018
11/3/2018
11/4/2018
11/5/2018
11/6/2018
11/7/2018

View of the Dataframe called reported:

emp_id, date
1001,11/1/2018
1001,11/2/2018
1002,11/1/2018
1002,11/3/2018
1003,11/3/2018

Expected output (that lists down days an employee did not report):

emp_id,date
1001,11/3/2018
1002,11/2/2018
1003,11/1/2018
1003,11/2/2018

Update I created the first Dataframe as below

import numpy as np
df_1 = np.sort(df_master.date.unique())
df_1_df = pd.DataFrame(df_1)
df_1_df.columns

RangeIndex(start=0, stop=1, step=1)

Upvotes: 1

Views: 747

Answers (1)

jezrael
jezrael

Reputation: 862681

You need merge with default inner join and then reindex by all unique values of emp_id and date, last merge with parameter indicator=True for filtering not reported rows:

reported['date'] = pd.to_datetime(reported['date'])

df1 = reported.set_index(['emp_id', 'date'])
mux = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)

df3 = (df1.reindex(mux)
         .reset_index()
         .merge(reported, indicator=True, how='outer')
         .query('_merge != "both"')
         .drop('_merge', axis=1))
print (df3)
   emp_id       date
2    1001 2018-11-03
4    1002 2018-11-02
6    1003 2018-11-01
7    1003 2018-11-02

Another solution, thanks @IMCoins is filtering by isin with ~ for invert boolean mask:

df2 = df1.reindex(mux)
df3 = df2[~df2.index.isin(df1.index)].reset_index()
print (df3)

   emp_id       date
0    1001 2018-11-03
1    1002 2018-11-02
2    1003 2018-11-01
3    1003 2018-11-02

Upvotes: 2

Related Questions