FunnyChef
FunnyChef

Reputation: 1946

How can I find non-matching values, by date, from two separate dataframes?

How can I find non-matching values, by date, from two separate dataframes?

My dataframes look like:

import pandas as pd

data_1 = {'date':['2019-05-01','2019-05-02'],
          'id': ['1122','1133']}
data_2 = {'id': ['1122','1133','1144']}

df1 = pd.DataFrame(data_1, columns=['date','id'])
df2 = pd.DataFrame(data_2, columns=['id'])

The result I need have is a dataframe that contains any id from df2 that doesn't match df1 and the date from df1.

So the results should look like;

   date     id
2019-05-01 1133
2019-05-01 1144
2019-05-02 1122
2019-05-02 1144

Upvotes: 1

Views: 63

Answers (3)

piRSquared
piRSquared

Reputation: 294258

append and drop_duplicates

Create a starting dataframe consisting of a product of dates from df1 and ids from df2

d = pd.DataFrame([
    [d, i] for d in df1.date
    for i in df2.id
], columns=df1.columns)

d.append(df1).drop_duplicates(keep=False)

         date    id
1  2019-05-01  1133
2  2019-05-01  1144
3  2019-05-02  1122
5  2019-05-02  1144

Or we can skip the existing ones during the construction step

tups = {*zip(*map(df1.get, df1))}
pd.DataFrame([
    [d, i] for d in df1.date
    for i in df2.id if (d, i) not in tups
], columns=df1.columns)

         date    id
0  2019-05-01  1133
1  2019-05-01  1144
2  2019-05-02  1122
3  2019-05-02  1144

Or with itertools.product and some set logic

My favorite by the way!

from itertools import product

pd.DataFrame(
    {*product(df1.date, df2.id)} - {*zip(*map(df1.get, df1))},
    columns=df1.columns
)

         date    id
0  2019-05-01  1144
1  2019-05-01  1133
2  2019-05-02  1144
3  2019-05-02  1122

Upvotes: 1

BENY
BENY

Reputation: 323236

This is more like CJ(cross join) problem with filter

df1.assign(key=1).merge(df2.assign(key=1),on='key').loc[lambda x : x['id_x']!=x['id_y']].drop(['key','id_x'],1)
Out[262]: 
         date  id_y
1  2019-05-01  1133
2  2019-05-01  1144
3  2019-05-02  1122
5  2019-05-02  1144

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150745

Try:

# first we change `df1.id` to columns by crosstab:
u = pd.crosstab(df1.date, df1.id)

# extend the id with df2.id
u = u.reindex(df2['id'], axis=1, fill_value=0).stack()

Then what you are looking for is

u[u.eq(0)].index.to_frame().reset_index(drop=True)

Output:

          date  id
0   2019-05-01  1133
1   2019-05-01  1144
2   2019-05-02  1122
3   2019-05-02  1144

Upvotes: 0

Related Questions