GRS
GRS

Reputation: 3084

Pandas DataFrame joining 2 tables on <,> conditions

I would like to remove all sessions after user conversion (and also removing the sessions that happened on the day of conversion)

full_sessions = pd.DataFrame(data={'user_id':[1,1,2,3,3], 'visit_no':[1,2,1,1,2], 'date':['20180307','20180308','20180307','20180308','20180308'], 'result':[0,1,1,0,0]})
print full_sessions 

       date  result  user_id  visit_no
0  20180307       0        1         1
1  20180308       1        1         2
2  20180307       1        2         1
3  20180308       0        3         1
4  20180308       0        3         2

When did people convert?

conversion = full_sessions[full_sessions['result'] == 1][['user_id','date']]
print conversion
   user_id      date
0        1  20180308
2        2  20180307

Ideal output:

       date  result  user_id  visit_no
0  20180307       0        1         1
3  20180308       0        3         1
4  20180308       0        3         2

What do I want in SQL?

SQL would be:
SELECT * FROM (
SELECT * FROM full_sessions
LEFT JOIN conversion
ON 
full_sessions.user_id = conversion.user_id AND full_sessions.date <  conversion.date
UNION ALL
SELECT * FROM full_sessions
WHERE user_id NOT IN (SELECT user_id FROM conversion)
)

Upvotes: 0

Views: 129

Answers (4)

Haleemur Ali
Haleemur Ali

Reputation: 28303

using groupby & apply & some final cleanup with reset index, you can express it in 1 very long statement:

full_sessions.groupby('user_id', as_index=False).apply(
    lambda x: x[:(x.result==1).values.argmax()] if any(x.result==1) else x
).reset_index(level=0, drop=True)

outputs:

       date  result  user_id  visit_no
0  20180307       0        1         1
3  20180308       0        3         1
4  20180308       0        3         2

Upvotes: 1

jpp
jpp

Reputation: 164753

Here is a method which maps a series instead of join / merge alternatives.

fs['date'] = pd.to_numeric(fs['date'])

s = fs[fs['result'] == 1].set_index('user_id')['date']

result = fs.loc[fs['date'] < fs['user_id'].map(s).fillna(fs['date'].max()+1)]

Result

       date  result  user_id  visit_no
0  20180307       0        1         1
3  20180308       0        3         1
4  20180308       0        3         2

Explanation

  • Create a mapping from user_id to conversion date, store it in a series s.
  • Then just filter on dates prior to conversion dates mapped via user_id.
  • If no conversion date, then data will be included since we fillna with a maximal date.
  • Consider using datetime objects. I have converted to numeric above for simplicity.

Upvotes: 1

Mabel Villalba
Mabel Villalba

Reputation: 2598

You can join the dataframes and then filter the rows matching your criteria this way:

df_join = full_sessions.join(conversion,lsuffix='',
                            rsuffix='_right',how='left',on='user_id')
print(df_join)

      date  result  user_id  visit_no  user_id_right date_right
0  20180307       0        1         1            1.0   20180308
1  20180308       1        1         2            1.0   20180308
2  20180307       1        2         1            2.0   20180307
3  20180308       0        3         1            NaN        NaN
4  20180308       0        3         2            NaN        NaN

And then just keep those with NaN in the right date or with date_right smaller than date:

>>> df_join[df_join.apply(lambda x: x.date < x.date_right 
                          if pd.isna(x.date_right) is False 
                          else True,axis=1)][['date','visit_no','user_id']]

        date  visit_no  user_id
0  20180307         1        1
3  20180308         1        3
4  20180308         2        3

Upvotes: 1

BENY
BENY

Reputation: 323316

IIUC using merge in pandas

full_sessions.merge(conversion,on='user_id',how='left').loc[lambda x : (x.date_y>x.date_x)|(x.date_y.isnull())].dropna(1)
Out[397]: 
     date_x  result  user_id  visit_no
0  20180307       0        1         1
3  20180308       0        3         1
4  20180308       0        3         2

Upvotes: 1

Related Questions