user636322
user636322

Reputation: 1201

How to merge two pandas DataFrames in Python?

I am trying to join two pandas data frames with an inner join.

my_df = pd.merge(df1, df2, how = 'inner', left_on = ['date'], right_on = ['myDate'])

However I am getting the following error:

KeyError: 'myDate' TypeError: an integer is required

I believe joining on dates is valid, however I cannot make this simple join work?

DF2 was created using the following

df2 = idf.groupby(lambda x: (x.year,x.month,x.day)).mean()

Can someone please advise? Thanks a lot.

df1
type    object
id      object
date    object
value   float64 

    type    id          date       value
0   CAR     PSTAT001    15/07/15    42
1   BIKE    PSTAT001    16/07/15    42
2   BIKE    PSTAT001    17/07/15    42
3   BIKE    PSTAT004    18/07/15    42
4   BIKE    PSTAT001    19/07/15    32

df2 
myDate  object
val1    float64
val2    float64
val3    float64

    myDate     val1         val2           val3
0   (2015,7,13) 1074        1871.666667    2800.777778
1   (2015,7,14) 347.958333  809.416667     1308.458333
2   (2015,7,15) 202.625     597.375        1008.666667
3   (2015,7,16) 494.958333  1192           1886.916667

DF1.info()

<class  'pandas.core.frame.DataFrame'>              
Int64Index: 3040    entries,    0   to  3039
Data    columns (total  4   columns):   
type    3040    non-null    object      
id      3040    non-null    object      
date    3040    non-null    object      
value   3040    non-null    float64     
dtypes: float64(1), object(3)           
memory  usage:  118.8+  KB  

DF2.info()

<class  'pandas.core.frame.DataFrame'>              
Int64Index: 16  entries,    0   to  15
Data    columns (total  4   columns):   
myDate  16  non-null    object      
val1    16  non-null    float64     
val2    16  non-null    float64     
val3    16  non-null    float64     
dtypes: float64(3), object(1)           
memory  usage:  640.0+  bytes   

Upvotes: 2

Views: 222

Answers (2)

AZhao
AZhao

Reputation: 14405

As the comments say, the lack of matching is coming from differing data formats. You have df1's 'date' field as a an object, but df2's 'myDate' as a object represented as a tuple.

First let's convert df1 'date' into datetime, as @EdChum suggests.

df1 = pd.DataFrame(data = np.array([['CAR', 'PSTAT001', '15/07/15', 42]]), \
    columns = ['type','id','date','value'])

df1['date']=pd.to_datetime(df1['date'])

Then, again as @EdChum suggests, we convert the tuple into the string using the datetime library.

df2 = pd.DataFrame(data = np.array([[(2015,7,15) ,202.625 ,597.375,1008.666667]]), \
    columns = ['myDate','val1','val2','val3'])

df2['myDate'] = df2['myDate'].apply(lambda x: datetime(x[0], x[1], x[2]))

And from there our merge works. I used only row 0 of df1 and row3 to make things simpler in my ide.

my_df=  pd.merge(df1, df2, how = 'inner', left_on = ['date'], right_on = ['myDate'])

my_df[:1]
Out[21]: 
  type        id       date value     myDate     val1     val2      val3
0  CAR  PSTAT001 2015-07-15    42 2015-07-15  202.625  597.375  1008.667

Upvotes: 2

EdChum
EdChum

Reputation: 393893

Your date columns are not datetime dtype, df1 looks like a str whilst the other is a tuple so you need to convert these first and then the merge will work:

In [75]:
df1['date'] = pd.to_datetime(df1['date'])
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 4 columns):
type     5 non-null object
id       5 non-null object
date     5 non-null datetime64[ns]
value    5 non-null int64
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 200.0+ bytes

In [76]:
import datetime as dt
df2['myDate'] = df2['myDate'].apply(lambda x: dt.datetime(x[0], x[1], x[2]))
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
myDate    4 non-null datetime64[ns]
val1      4 non-null float64
val2      4 non-null float64
val3      4 non-null float64
dtypes: datetime64[ns](1), float64(3)
memory usage: 160.0 bytes

In [78]:    
my_df=  pd.merge(df1, df2, how = 'inner', left_on = ['date'], right_on = ['myDate'])
my_df

Out[78]:
   type        id       date  value     myDate        val1      val2  \
0   CAR  PSTAT001 2015-07-15     42 2015-07-15  202.625000   597.375   
1  BIKE  PSTAT001 2015-07-16     42 2015-07-16  494.958333  1192.000   

          val3  
0  1008.666667  
1  1886.916667 

Upvotes: 2

Related Questions