Patricia
Patricia

Reputation: 45

Pandas join/merge 2 dataframes using date as index

I have 2 big dataframes with date as index. To simplyfy an example let's say that they look like this (number of data in particular date in the first dataframe is not the same as in second):

df1:

      Date    X    Y
2000-01-01   x1   y1
2000-01-01   x2   y2
2000-01-02   x3   y3
2000-01-03   x4   y4
2000-01-03   x5   y5
2000-01-03   x6   y6

df2:

      Date  X_2  Y_2
2000-01-01   X1   Y1
2000-01-01   X2   Y2
2000-01-01   X3   Y3
2000-01-03   X4   Y4
2000-01-03   X5   Y5

The output should look like this (I want merge only data with dates whitch appear in both dataframes):

      Date    X    Y  X_2  Y_2
2000-01-01   x1   y1   X1   Y1
2000-01-01   x2   y2   X2   Y2
2000-01-01  NaN  NaN   X3   Y3
2000-01-03   x4   y4   X4   Y4
2000-01-03   x5   y5   X5   Y5
2000-01-03   x6   y6  NaN  NaN

I've tried different code combinations and I keep getting duplicated data like this:

      Date    X    Y  X_2  Y_2
2000-01-01   x1   y1   X1   Y1
2000-01-01   x1   y1   X2   Y2
2000-01-01   x1   y1   X3   Y3
2000-01-01   x2   y2   X1   Y1
2000-01-01   x2   y2   X2   Y2
2000-01-01   x2   y2   X3   Y3

I've tried e.g. result = pd.merge(df1,df2, how='inner', on='Date') What to do in order to get the result I want?

Upvotes: 3

Views: 142

Answers (1)

unutbu
unutbu

Reputation: 879103

Use cumcount to number the items in each group, when grouped by Date:

In [107]: df1['count'] = df1.groupby('Date').cumcount()

In [108]: df1
Out[108]: 
         Date   X   Y  count
0  2000-01-01  x1  y1      0
1  2000-01-01  x2  y2      1
2  2000-01-02  x3  y3      0
3  2000-01-03  x4  y4      0
4  2000-01-03  x5  y5      1
5  2000-01-03  x6  y6      2

In [109]: df2['count'] = df2.groupby('Date').cumcount()

In [110]: df2
Out[110]: 
         Date X_2 Y_2  count
0  2000-01-01  X1  Y1      0
1  2000-01-01  X2  Y2      1
2  2000-01-01  X3  Y3      2
3  2000-01-03  X4  Y4      0
4  2000-01-03  X5  Y5      1

By adding the count column, you can now merge on both Date and count which gets you close to the result you want:

In [111]: pd.merge(df1, df2, on=['Date', 'count'], how='outer')
Out[111]: 
         Date    X    Y  count  X_2  Y_2
0  2000-01-01   x1   y1      0   X1   Y1
1  2000-01-01   x2   y2      1   X2   Y2
2  2000-01-02   x3   y3      0  NaN  NaN
3  2000-01-03   x4   y4      0   X4   Y4
4  2000-01-03   x5   y5      1   X5   Y5
5  2000-01-03   x6   y6      2  NaN  NaN
6  2000-01-01  NaN  NaN      2   X3   Y3

The rows that you wish to remove can be characterized as those where count equals 0 and X or X_2 equals NaN. Therefore, you could remove those rows with a boolean mask like this:

mask = (result['count'] == 0) & pd.isnull(result).any(axis=1)
result = result.loc[~mask]

import pandas as pd

df1 = pd.DataFrame({'Date': ['2000-01-01',
  '2000-01-01',
  '2000-01-02',
  '2000-01-03',
  '2000-01-03',
  '2000-01-03'],
 'X': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6'],
 'Y': ['y1', 'y2', 'y3', 'y4', 'y5', 'y6']})

df2 = pd.DataFrame({'Date': ['2000-01-01',
  '2000-01-01',
  '2000-01-01',
  '2000-01-03',
  '2000-01-03'],
 'X_2': ['X1', 'X2', 'X3', 'X4', 'X5'],
 'Y_2': ['Y1', 'Y2', 'Y3', 'Y4', 'Y5']})


df1['count'] = df1.groupby('Date').cumcount()
df2['count'] = df2.groupby('Date').cumcount()
result = pd.merge(df1, df2, on=['Date', 'count'], how='outer')
mask = (result['count'] == 0) & pd.isnull(result).any(axis=1)
result = result.loc[~mask]
result = result.drop('count', axis=1)

yields

         Date    X    Y  count  X_2  Y_2
0  2000-01-01   x1   y1      0   X1   Y1
1  2000-01-01   x2   y2      1   X2   Y2
3  2000-01-03   x4   y4      0   X4   Y4
4  2000-01-03   x5   y5      1   X5   Y5
5  2000-01-03   x6   y6      2  NaN  NaN
6  2000-01-01  NaN  NaN      2   X3   Y3

Another way to restrict the merge to only those dates which are common to both df1 and df2 would be find the intersection of df1['Date'] and df2['Date'] first, and then apply pd.merge to sub-DataFrames of df1 and df2 which contain only those dates:

import numpy as np
dates = np.intersect1d(df1['Date'], df2['Date'])
mask1 = df1['Date'].isin(dates)
mask2 = df2['Date'].isin(dates)
result = pd.merge(df1.loc[mask1], df2.loc[mask2], on=['Date', 'count'], how='outer')

Upvotes: 4

Related Questions