Reputation: 397
I have two dataframe and I would like to concat them based on time ranges for example
dataframe A
user timestamp product
A 2015/3/13 1
B 2015/3/15 2
dataframe B
user time behavior
A 2015/3/1 2
A 2015/3/8 3
A 2015/3/13 1
B 2015/3/1 2
I would like to concat 2 dataframe as below ( frame B left join to frame A) column "timestamp1" is 7 days before column "timestamp" for example when timestamp is 3/13 , then 3/6-13 is in the range otherwise dont concat
user timestamp product time1 behavior
A 2015/3/13 1 2015/3/8 3
A 2015/3/13 1 2015/3/13 1
B 2015/3/15 2 NaN NaN
the sql code would look like
select * from
B left join A
on user
where B.time >= A.timestamp - 7 & B.time <= A.timestamp
##WHERE B.time BETWEEN DATE_SUB(B.time, INTERVAL 7 DAY) AND A.timestamp ;
how can we make this on python?
can only think of the following and dont know how to work with the time..
new = pd.merge(A, B, on='user', how='left')
thanks and sorry..
Upvotes: 2
Views: 2468
Reputation: 21274
Here's one solution that relies on two merges - first, to narrow down dataframe B
(df2
), and then to produce the desired output:
We can read in the example dataframes with read_clipboard()
:
import pandas as pd
# copy dataframe A data, then:
df1 = pd.read_clipboard(parse_dates=['timestamp'])
# copy dataframe B data, then:
df2 = pd.read_clipboard(parse_dates=['time'])
Merge and filter:
# left merge df2, df1
tmp = df2.merge(df1, on='user', how='left')
# then drop rows which disqualify based on timestamp
mask = tmp.time < (tmp.timestamp - pd.Timedelta(days=7))
tmp.loc[mask, ['time', 'behavior']] = None
tmp = tmp.dropna(subset=['time']).drop(['timestamp','product'], axis=1)
# now perform final merge
merged = df1.merge(tmp, on='user', how='left')
Output:
user timestamp product time behavior
0 A 2015-03-13 1 2015-03-08 3.0
1 A 2015-03-13 1 2015-03-13 1.0
2 B 2015-03-15 2 NaT NaN
Upvotes: 0
Reputation: 643
The few steps required to solve this-
from datetime import timedelta
First,convert your timestamps to pandas datetime. (df1 refers to Dataframe A and df2 refers to Dataframe B)
df1[['time']]=df1[['timestamp']].apply(pd.to_datetime)
df2[['time']]=df2[['time']].apply(pd.to_datetime)
Merge as follows: (Based on your final dataset i think your left join is more of a right join)
df3 = pd.merge(df1,df2,how='left')
Get your final df:
df4 = df3[(df3.time>=df3.timestamp-timedelta(days=7)) & (df3.time<=df3.timestamp)]
The row containing nan is missing and this is because of the way conditional joins are done in pandas.
Condtional joins are not a feature of pandas yet. A way to get past that is by doing filtering post a join.
Upvotes: 2