Reputation: 45
I have 2 big dataframes with date as index. To simplyfy an example let's say that they look like this (number of data in particular date in the first dataframe is not the same as in second):
df1:
Date X Y
2000-01-01 x1 y1
2000-01-01 x2 y2
2000-01-02 x3 y3
2000-01-03 x4 y4
2000-01-03 x5 y5
2000-01-03 x6 y6
df2:
Date X_2 Y_2
2000-01-01 X1 Y1
2000-01-01 X2 Y2
2000-01-01 X3 Y3
2000-01-03 X4 Y4
2000-01-03 X5 Y5
The output should look like this (I want merge only data with dates whitch appear in both dataframes):
Date X Y X_2 Y_2
2000-01-01 x1 y1 X1 Y1
2000-01-01 x2 y2 X2 Y2
2000-01-01 NaN NaN X3 Y3
2000-01-03 x4 y4 X4 Y4
2000-01-03 x5 y5 X5 Y5
2000-01-03 x6 y6 NaN NaN
I've tried different code combinations and I keep getting duplicated data like this:
Date X Y X_2 Y_2
2000-01-01 x1 y1 X1 Y1
2000-01-01 x1 y1 X2 Y2
2000-01-01 x1 y1 X3 Y3
2000-01-01 x2 y2 X1 Y1
2000-01-01 x2 y2 X2 Y2
2000-01-01 x2 y2 X3 Y3
I've tried e.g. result = pd.merge(df1,df2, how='inner', on='Date')
What to do in order to get the result I want?
Upvotes: 3
Views: 142
Reputation: 879103
Use cumcount
to number the items in each group, when grouped by Date
:
In [107]: df1['count'] = df1.groupby('Date').cumcount()
In [108]: df1
Out[108]:
Date X Y count
0 2000-01-01 x1 y1 0
1 2000-01-01 x2 y2 1
2 2000-01-02 x3 y3 0
3 2000-01-03 x4 y4 0
4 2000-01-03 x5 y5 1
5 2000-01-03 x6 y6 2
In [109]: df2['count'] = df2.groupby('Date').cumcount()
In [110]: df2
Out[110]:
Date X_2 Y_2 count
0 2000-01-01 X1 Y1 0
1 2000-01-01 X2 Y2 1
2 2000-01-01 X3 Y3 2
3 2000-01-03 X4 Y4 0
4 2000-01-03 X5 Y5 1
By adding the count
column, you can now merge on both Date
and count
which gets you close to the result you want:
In [111]: pd.merge(df1, df2, on=['Date', 'count'], how='outer')
Out[111]:
Date X Y count X_2 Y_2
0 2000-01-01 x1 y1 0 X1 Y1
1 2000-01-01 x2 y2 1 X2 Y2
2 2000-01-02 x3 y3 0 NaN NaN
3 2000-01-03 x4 y4 0 X4 Y4
4 2000-01-03 x5 y5 1 X5 Y5
5 2000-01-03 x6 y6 2 NaN NaN
6 2000-01-01 NaN NaN 2 X3 Y3
The rows that you wish to remove can be characterized as those where count equals 0 and X or X_2 equals NaN. Therefore, you could remove those rows with a boolean mask like this:
mask = (result['count'] == 0) & pd.isnull(result).any(axis=1)
result = result.loc[~mask]
import pandas as pd
df1 = pd.DataFrame({'Date': ['2000-01-01',
'2000-01-01',
'2000-01-02',
'2000-01-03',
'2000-01-03',
'2000-01-03'],
'X': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6'],
'Y': ['y1', 'y2', 'y3', 'y4', 'y5', 'y6']})
df2 = pd.DataFrame({'Date': ['2000-01-01',
'2000-01-01',
'2000-01-01',
'2000-01-03',
'2000-01-03'],
'X_2': ['X1', 'X2', 'X3', 'X4', 'X5'],
'Y_2': ['Y1', 'Y2', 'Y3', 'Y4', 'Y5']})
df1['count'] = df1.groupby('Date').cumcount()
df2['count'] = df2.groupby('Date').cumcount()
result = pd.merge(df1, df2, on=['Date', 'count'], how='outer')
mask = (result['count'] == 0) & pd.isnull(result).any(axis=1)
result = result.loc[~mask]
result = result.drop('count', axis=1)
yields
Date X Y count X_2 Y_2
0 2000-01-01 x1 y1 0 X1 Y1
1 2000-01-01 x2 y2 1 X2 Y2
3 2000-01-03 x4 y4 0 X4 Y4
4 2000-01-03 x5 y5 1 X5 Y5
5 2000-01-03 x6 y6 2 NaN NaN
6 2000-01-01 NaN NaN 2 X3 Y3
Another way to restrict the merge to only those dates which are common to both
df1
and df2
would be find the intersection of df1['Date']
and
df2['Date']
first, and then apply pd.merge
to sub-DataFrames of df1
and df2
which contain only those dates:
import numpy as np
dates = np.intersect1d(df1['Date'], df2['Date'])
mask1 = df1['Date'].isin(dates)
mask2 = df2['Date'].isin(dates)
result = pd.merge(df1.loc[mask1], df2.loc[mask2], on=['Date', 'count'], how='outer')
Upvotes: 4