Winds
Winds

Reputation: 397

how to concat 2 pandas dataframe based on time ranges

I have two dataframe and I would like to concat them based on time ranges for example

dataframe A

user   timestamp    product
 A      2015/3/13      1
 B      2015/3/15      2

dataframe B

 user      time     behavior
   A     2015/3/1      2
   A     2015/3/8      3
   A     2015/3/13     1
   B     2015/3/1      2

I would like to concat 2 dataframe as below ( frame B left join to frame A) column "timestamp1" is 7 days before column "timestamp" for example when timestamp is 3/13 , then 3/6-13 is in the range otherwise dont concat

user   timestamp    product        time1       behavior 
 A      2015/3/13      1          2015/3/8        3
 A      2015/3/13      1          2015/3/13       1
 B      2015/3/15      2            NaN          NaN

the sql code would look like

select * from 
B left join  A
on user
where B.time >= A.timestamp - 7  &  B.time <= A.timestamp 

##WHERE B.time BETWEEN DATE_SUB(B.time, INTERVAL 7 DAY) AND A.timestamp ;

how can we make this on python?

can only think of the following and dont know how to work with the time..

new = pd.merge(A, B, on='user', how='left')

thanks and sorry..

Upvotes: 2

Views: 2468

Answers (2)

andrew_reece
andrew_reece

Reputation: 21274

Here's one solution that relies on two merges - first, to narrow down dataframe B (df2), and then to produce the desired output:

We can read in the example dataframes with read_clipboard():

import pandas as pd

# copy dataframe A data, then:
df1 = pd.read_clipboard(parse_dates=['timestamp'])

# copy dataframe B data, then:
df2 = pd.read_clipboard(parse_dates=['time'])

Merge and filter:

# left merge df2, df1
tmp = df2.merge(df1, on='user', how='left')

# then drop rows which disqualify based on timestamp
mask = tmp.time < (tmp.timestamp - pd.Timedelta(days=7))
tmp.loc[mask, ['time', 'behavior']] = None
tmp = tmp.dropna(subset=['time']).drop(['timestamp','product'], axis=1)

# now perform final merge
merged = df1.merge(tmp, on='user', how='left')

Output:

  user  timestamp  product       time  behavior
0    A 2015-03-13        1 2015-03-08       3.0
1    A 2015-03-13        1 2015-03-13       1.0
2    B 2015-03-15        2        NaT       NaN

Upvotes: 0

Satyadev
Satyadev

Reputation: 643

The few steps required to solve this-

from datetime import timedelta 

First,convert your timestamps to pandas datetime. (df1 refers to Dataframe A and df2 refers to Dataframe B)

df1[['time']]=df1[['timestamp']].apply(pd.to_datetime)
df2[['time']]=df2[['time']].apply(pd.to_datetime)

Merge as follows: (Based on your final dataset i think your left join is more of a right join)

df3 = pd.merge(df1,df2,how='left') 

Get your final df:

df4 = df3[(df3.time>=df3.timestamp-timedelta(days=7)) & (df3.time<=df3.timestamp)]

The row containing nan is missing and this is because of the way conditional joins are done in pandas.

Condtional joins are not a feature of pandas yet. A way to get past that is by doing filtering post a join.

Upvotes: 2

Related Questions