Reputation: 1201
I am trying to join two pandas data frames with an inner join.
my_df = pd.merge(df1, df2, how = 'inner', left_on = ['date'], right_on = ['myDate'])
However I am getting the following error:
KeyError: 'myDate' TypeError: an integer is required
I believe joining on dates is valid, however I cannot make this simple join work?
DF2 was created using the following
df2 = idf.groupby(lambda x: (x.year,x.month,x.day)).mean()
Can someone please advise? Thanks a lot.
df1
type object
id object
date object
value float64
type id date value
0 CAR PSTAT001 15/07/15 42
1 BIKE PSTAT001 16/07/15 42
2 BIKE PSTAT001 17/07/15 42
3 BIKE PSTAT004 18/07/15 42
4 BIKE PSTAT001 19/07/15 32
df2
myDate object
val1 float64
val2 float64
val3 float64
myDate val1 val2 val3
0 (2015,7,13) 1074 1871.666667 2800.777778
1 (2015,7,14) 347.958333 809.416667 1308.458333
2 (2015,7,15) 202.625 597.375 1008.666667
3 (2015,7,16) 494.958333 1192 1886.916667
DF1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3040 entries, 0 to 3039
Data columns (total 4 columns):
type 3040 non-null object
id 3040 non-null object
date 3040 non-null object
value 3040 non-null float64
dtypes: float64(1), object(3)
memory usage: 118.8+ KB
DF2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16 entries, 0 to 15
Data columns (total 4 columns):
myDate 16 non-null object
val1 16 non-null float64
val2 16 non-null float64
val3 16 non-null float64
dtypes: float64(3), object(1)
memory usage: 640.0+ bytes
Upvotes: 2
Views: 222
Reputation: 14405
As the comments say, the lack of matching is coming from differing data formats. You have df1's 'date' field as a an object, but df2's 'myDate' as a object represented as a tuple.
First let's convert df1 'date' into datetime, as @EdChum suggests.
df1 = pd.DataFrame(data = np.array([['CAR', 'PSTAT001', '15/07/15', 42]]), \
columns = ['type','id','date','value'])
df1['date']=pd.to_datetime(df1['date'])
Then, again as @EdChum suggests, we convert the tuple into the string using the datetime library.
df2 = pd.DataFrame(data = np.array([[(2015,7,15) ,202.625 ,597.375,1008.666667]]), \
columns = ['myDate','val1','val2','val3'])
df2['myDate'] = df2['myDate'].apply(lambda x: datetime(x[0], x[1], x[2]))
And from there our merge works. I used only row 0 of df1 and row3 to make things simpler in my ide.
my_df= pd.merge(df1, df2, how = 'inner', left_on = ['date'], right_on = ['myDate'])
my_df[:1]
Out[21]:
type id date value myDate val1 val2 val3
0 CAR PSTAT001 2015-07-15 42 2015-07-15 202.625 597.375 1008.667
Upvotes: 2
Reputation: 393893
Your date columns are not datetime dtype, df1
looks like a str
whilst the other is a tuple
so you need to convert these first and then the merge will work:
In [75]:
df1['date'] = pd.to_datetime(df1['date'])
df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 4 columns):
type 5 non-null object
id 5 non-null object
date 5 non-null datetime64[ns]
value 5 non-null int64
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 200.0+ bytes
In [76]:
import datetime as dt
df2['myDate'] = df2['myDate'].apply(lambda x: dt.datetime(x[0], x[1], x[2]))
df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
myDate 4 non-null datetime64[ns]
val1 4 non-null float64
val2 4 non-null float64
val3 4 non-null float64
dtypes: datetime64[ns](1), float64(3)
memory usage: 160.0 bytes
In [78]:
my_df= pd.merge(df1, df2, how = 'inner', left_on = ['date'], right_on = ['myDate'])
my_df
Out[78]:
type id date value myDate val1 val2 \
0 CAR PSTAT001 2015-07-15 42 2015-07-15 202.625000 597.375
1 BIKE PSTAT001 2015-07-16 42 2015-07-16 494.958333 1192.000
val3
0 1008.666667
1 1886.916667
Upvotes: 2