vera
vera

Reputation: 45

pandas merge on date like columns does not work

first, the solution didn't work in my code pandas merge on date column issue

I have two data frame comes from mysql query result, both of them have 'captureDate' column. In mysql table, the data type is 'date'. In the data frame, the data type is object.

df1['captureDate'] data

0    2017-06-28
1    2017-06-28
2    2017-06-28
3    2017-06-28
4    2017-06-28
5    2017-06-28
6    2017-06-28
Name: captureDate, dtype: object

df2['captureDate'] data

0    2017-06-28
1    2017-06-28
2    2017-06-28
3    2017-06-28
4    2017-06-28
5    2017-06-28
6    2017-06-28
Name: captureDate, dtype: object

when I compare the column of df1 and df2, it returns True

print df1['captureDate'].equals(df2['captureDate'])

my merge code

inner = pd.merge(df1, df2,  on='captureDate', how='inner')

but, the result is wrong, it returned 49 rows. The inner info is blow:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49 entries, 0 to 48
Data columns (total 20 columns):
rule_id_x          49 non-null int64
monitor_sites_x    49 non-null object
rule_type_x        49 non-null int64
lower_limit_x      49 non-null int64
upper_limit_x      49 non-null int64
actual_x           49 non-null int64
predict_x          49 non-null int64
captureDate        49 non-null object
deviation_x        49 non-null float32
create_time_x      49 non-null int64
actual_y           49 non-null int64
create_time_y      49 non-null int64
deviation_y        49 non-null object
id                 49 non-null int64
lower_limit_y      49 non-null int64
monitor_sites_y    49 non-null object
predict_y          49 non-null int64
rule_id_y          49 non-null object
rule_type_y        49 non-null int64
upper_limit_y      49 non-null int64

so, why it happens and how to handle this issue?

Upvotes: 2

Views: 4061

Answers (1)

jezrael
jezrael

Reputation: 862601

Sample:

df1 = pd.DataFrame({'captureDate':['2017-06-22'] *3 +['2017-06-25'] * 3 +['2017-06-28'] * 2,
                   'rule_id':[40,10,20,30,70,10,60,10]})
print (df1)
  captureDate  rule_id
0  2017-06-22       40
1  2017-06-22       10
2  2017-06-22       20
3  2017-06-25       30
4  2017-06-25       70
5  2017-06-25       10
6  2017-06-28       60
7  2017-06-28       10
df2 = pd.DataFrame({'captureDate':['2017-06-22'] *3 +['2017-06-25'] * 3 +['2017-06-28'] * 2,
                   'rule_id':[1,2,3,4,5,6,7,8]})
print (df2)
  captureDate  rule_id
0  2017-06-22        1
1  2017-06-22        2
2  2017-06-22        3
3  2017-06-25        4
4  2017-06-25        5
5  2017-06-25        6
6  2017-06-28        7
7  2017-06-28        8

First convert to datetime by to_datetime:

df1['captureDate'] = pd.to_datetime(df1['captureDate'])
df2['captureDate']  = pd.to_datetime(df2['captureDate'])

Problem are duplicates in both columns:

print (df1['captureDate'].equals(df2['captureDate']))
True

inner = pd.merge(df1, df2,  on='captureDate', how='inner')
print (inner)
   captureDate  rule_id_x  rule_id_y
0   2017-06-22         40          1
1   2017-06-22         40          2
2   2017-06-22         40          3
3   2017-06-22         10          1
4   2017-06-22         10          2
5   2017-06-22         10          3
6   2017-06-22         20          1
7   2017-06-22         20          2
8   2017-06-22         20          3
9   2017-06-25         30          4
10  2017-06-25         30          5
11  2017-06-25         30          6
12  2017-06-25         70          4
13  2017-06-25         70          5
14  2017-06-25         70          6
15  2017-06-25         10          4
16  2017-06-25         10          5
17  2017-06-25         10          6
18  2017-06-28         60          7
19  2017-06-28         60          8
20  2017-06-28         10          7
21  2017-06-28         10          8

Possible solutions

Use concat with set_index, then flatten MultiIndex by map and join:

df3 = pd.concat([df1.set_index('captureDate'), 
                 df2.set_index('captureDate')], 
                 axis=1, 
                 keys=('a', 'b'))
df3.columns = df3.columns.map('_'.join)
print (df3)
             a_rule_id  b_rule_id
captureDate                      
2017-06-22          40          1
2017-06-22          10          2
2017-06-22          20          3
2017-06-25          30          4
2017-06-25          70          5
2017-06-25          10          6
2017-06-28          60          7
2017-06-28          10          8

Or remove duplicates by drop_duplicates or aggregation of data by captureDate in both df:

df1 = df1.drop_duplicates('captureDate')
df2 = df2.drop_duplicates('captureDate')
print (df1)
  captureDate  rule_id
0  2017-06-22       40
3  2017-06-25       30
6  2017-06-28       60

print (df2)
  captureDate  rule_id
0  2017-06-22        1
3  2017-06-25        4
6  2017-06-28        7

inner = pd.merge(df1, df2,  on='captureDate', how='inner')
print (inner)
  captureDate  rule_id_x  rule_id_y
0  2017-06-22         40          1
1  2017-06-25         30          4
2  2017-06-28         60          7

EDIT1:

You can use cumcount for count duplicates by column captureDate and then merge. Last remove helper column new by drop:

df1 = pd.DataFrame({'captureDate':['2017-06-22']* 3 + ['2017-06-25']* 3 + ['2017-06-28'] * 2,
                   'rule_id':[40,10,20,30,70,10,60,10]})

df2 = pd.DataFrame({'captureDate':['2017-06-22'] * 3 + ['2017-06-25'] * 3,
                   'rule_id':[1,2,3,4,5,6]})


df1['new'] = df1.groupby('captureDate').cumcount()
df2['new'] = df2.groupby('captureDate').cumcount()
print (df1)
  captureDate  rule_id  new
0  2017-06-22       40    0
1  2017-06-22       10    1
2  2017-06-22       20    2
3  2017-06-25       30    0
4  2017-06-25       70    1
5  2017-06-25       10    2
6  2017-06-28       60    0
7  2017-06-28       10    1

print (df2)
  captureDate  rule_id  new
0  2017-06-22        1    0
1  2017-06-22        2    1
2  2017-06-22        3    2
3  2017-06-25        4    0
4  2017-06-25        5    1
5  2017-06-25        6    2

df3 = pd.merge(df1, df2, on=['captureDate','new']).drop('new', axis=1)
print (df3)
  captureDate  rule_id_x  rule_id_y
0  2017-06-22         40          1
1  2017-06-22         10          2
2  2017-06-22         20          3
3  2017-06-25         30          4
4  2017-06-25         70          5
5  2017-06-25         10          6

Upvotes: 2

Related Questions