Reputation: 45
first, the solution didn't work in my code pandas merge on date column issue
I have two data frame comes from mysql query result, both of them have 'captureDate' column. In mysql table, the data type is 'date'. In the data frame, the data type is object.
df1['captureDate'] data
0 2017-06-28
1 2017-06-28
2 2017-06-28
3 2017-06-28
4 2017-06-28
5 2017-06-28
6 2017-06-28
Name: captureDate, dtype: object
df2['captureDate'] data
0 2017-06-28
1 2017-06-28
2 2017-06-28
3 2017-06-28
4 2017-06-28
5 2017-06-28
6 2017-06-28
Name: captureDate, dtype: object
when I compare the column of df1 and df2, it returns True
print df1['captureDate'].equals(df2['captureDate'])
my merge code
inner = pd.merge(df1, df2, on='captureDate', how='inner')
but, the result is wrong, it returned 49 rows. The inner info is blow:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 49 entries, 0 to 48
Data columns (total 20 columns):
rule_id_x 49 non-null int64
monitor_sites_x 49 non-null object
rule_type_x 49 non-null int64
lower_limit_x 49 non-null int64
upper_limit_x 49 non-null int64
actual_x 49 non-null int64
predict_x 49 non-null int64
captureDate 49 non-null object
deviation_x 49 non-null float32
create_time_x 49 non-null int64
actual_y 49 non-null int64
create_time_y 49 non-null int64
deviation_y 49 non-null object
id 49 non-null int64
lower_limit_y 49 non-null int64
monitor_sites_y 49 non-null object
predict_y 49 non-null int64
rule_id_y 49 non-null object
rule_type_y 49 non-null int64
upper_limit_y 49 non-null int64
so, why it happens and how to handle this issue?
Upvotes: 2
Views: 4061
Reputation: 862601
Sample:
df1 = pd.DataFrame({'captureDate':['2017-06-22'] *3 +['2017-06-25'] * 3 +['2017-06-28'] * 2,
'rule_id':[40,10,20,30,70,10,60,10]})
print (df1)
captureDate rule_id
0 2017-06-22 40
1 2017-06-22 10
2 2017-06-22 20
3 2017-06-25 30
4 2017-06-25 70
5 2017-06-25 10
6 2017-06-28 60
7 2017-06-28 10
df2 = pd.DataFrame({'captureDate':['2017-06-22'] *3 +['2017-06-25'] * 3 +['2017-06-28'] * 2,
'rule_id':[1,2,3,4,5,6,7,8]})
print (df2)
captureDate rule_id
0 2017-06-22 1
1 2017-06-22 2
2 2017-06-22 3
3 2017-06-25 4
4 2017-06-25 5
5 2017-06-25 6
6 2017-06-28 7
7 2017-06-28 8
First convert to datetime by to_datetime
:
df1['captureDate'] = pd.to_datetime(df1['captureDate'])
df2['captureDate'] = pd.to_datetime(df2['captureDate'])
Problem are duplicates in both columns:
print (df1['captureDate'].equals(df2['captureDate']))
True
inner = pd.merge(df1, df2, on='captureDate', how='inner')
print (inner)
captureDate rule_id_x rule_id_y
0 2017-06-22 40 1
1 2017-06-22 40 2
2 2017-06-22 40 3
3 2017-06-22 10 1
4 2017-06-22 10 2
5 2017-06-22 10 3
6 2017-06-22 20 1
7 2017-06-22 20 2
8 2017-06-22 20 3
9 2017-06-25 30 4
10 2017-06-25 30 5
11 2017-06-25 30 6
12 2017-06-25 70 4
13 2017-06-25 70 5
14 2017-06-25 70 6
15 2017-06-25 10 4
16 2017-06-25 10 5
17 2017-06-25 10 6
18 2017-06-28 60 7
19 2017-06-28 60 8
20 2017-06-28 10 7
21 2017-06-28 10 8
Possible solutions
Use concat
with set_index
, then flatten MultiIndex
by map
and join
:
df3 = pd.concat([df1.set_index('captureDate'),
df2.set_index('captureDate')],
axis=1,
keys=('a', 'b'))
df3.columns = df3.columns.map('_'.join)
print (df3)
a_rule_id b_rule_id
captureDate
2017-06-22 40 1
2017-06-22 10 2
2017-06-22 20 3
2017-06-25 30 4
2017-06-25 70 5
2017-06-25 10 6
2017-06-28 60 7
2017-06-28 10 8
Or remove duplicates by drop_duplicates
or aggregation of data by captureDate
in both df
:
df1 = df1.drop_duplicates('captureDate')
df2 = df2.drop_duplicates('captureDate')
print (df1)
captureDate rule_id
0 2017-06-22 40
3 2017-06-25 30
6 2017-06-28 60
print (df2)
captureDate rule_id
0 2017-06-22 1
3 2017-06-25 4
6 2017-06-28 7
inner = pd.merge(df1, df2, on='captureDate', how='inner')
print (inner)
captureDate rule_id_x rule_id_y
0 2017-06-22 40 1
1 2017-06-25 30 4
2 2017-06-28 60 7
EDIT1:
You can use cumcount
for count duplicates by column captureDate
and then merge
. Last remove helper column new
by drop
:
df1 = pd.DataFrame({'captureDate':['2017-06-22']* 3 + ['2017-06-25']* 3 + ['2017-06-28'] * 2,
'rule_id':[40,10,20,30,70,10,60,10]})
df2 = pd.DataFrame({'captureDate':['2017-06-22'] * 3 + ['2017-06-25'] * 3,
'rule_id':[1,2,3,4,5,6]})
df1['new'] = df1.groupby('captureDate').cumcount()
df2['new'] = df2.groupby('captureDate').cumcount()
print (df1)
captureDate rule_id new
0 2017-06-22 40 0
1 2017-06-22 10 1
2 2017-06-22 20 2
3 2017-06-25 30 0
4 2017-06-25 70 1
5 2017-06-25 10 2
6 2017-06-28 60 0
7 2017-06-28 10 1
print (df2)
captureDate rule_id new
0 2017-06-22 1 0
1 2017-06-22 2 1
2 2017-06-22 3 2
3 2017-06-25 4 0
4 2017-06-25 5 1
5 2017-06-25 6 2
df3 = pd.merge(df1, df2, on=['captureDate','new']).drop('new', axis=1)
print (df3)
captureDate rule_id_x rule_id_y
0 2017-06-22 40 1
1 2017-06-22 10 2
2 2017-06-22 20 3
3 2017-06-25 30 4
4 2017-06-25 70 5
5 2017-06-25 10 6
Upvotes: 2