Reputation: 3084
I would like to remove all sessions after user conversion (and also removing the sessions that happened on the day of conversion)
full_sessions = pd.DataFrame(data={'user_id':[1,1,2,3,3], 'visit_no':[1,2,1,1,2], 'date':['20180307','20180308','20180307','20180308','20180308'], 'result':[0,1,1,0,0]})
print full_sessions
date result user_id visit_no
0 20180307 0 1 1
1 20180308 1 1 2
2 20180307 1 2 1
3 20180308 0 3 1
4 20180308 0 3 2
When did people convert?
conversion = full_sessions[full_sessions['result'] == 1][['user_id','date']]
print conversion
user_id date
0 1 20180308
2 2 20180307
Ideal output:
date result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2
What do I want in SQL?
SQL would be:
SELECT * FROM (
SELECT * FROM full_sessions
LEFT JOIN conversion
ON
full_sessions.user_id = conversion.user_id AND full_sessions.date < conversion.date
UNION ALL
SELECT * FROM full_sessions
WHERE user_id NOT IN (SELECT user_id FROM conversion)
)
Upvotes: 0
Views: 129
Reputation: 28303
using groupby & apply & some final cleanup with reset index, you can express it in 1 very long statement:
full_sessions.groupby('user_id', as_index=False).apply(
lambda x: x[:(x.result==1).values.argmax()] if any(x.result==1) else x
).reset_index(level=0, drop=True)
outputs:
date result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2
Upvotes: 1
Reputation: 164753
Here is a method which maps a series instead of join / merge alternatives.
fs['date'] = pd.to_numeric(fs['date'])
s = fs[fs['result'] == 1].set_index('user_id')['date']
result = fs.loc[fs['date'] < fs['user_id'].map(s).fillna(fs['date'].max()+1)]
Result
date result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2
Explanation
user_id
to conversion date, store it in a series s
.user_id
.fillna
with a maximal date.datetime
objects. I have converted to numeric above for simplicity.Upvotes: 1
Reputation: 2598
You can join the dataframes and then filter the rows matching your criteria this way:
df_join = full_sessions.join(conversion,lsuffix='',
rsuffix='_right',how='left',on='user_id')
print(df_join)
date result user_id visit_no user_id_right date_right
0 20180307 0 1 1 1.0 20180308
1 20180308 1 1 2 1.0 20180308
2 20180307 1 2 1 2.0 20180307
3 20180308 0 3 1 NaN NaN
4 20180308 0 3 2 NaN NaN
And then just keep those with NaN in the right date or with date_right
smaller than date
:
>>> df_join[df_join.apply(lambda x: x.date < x.date_right
if pd.isna(x.date_right) is False
else True,axis=1)][['date','visit_no','user_id']]
date visit_no user_id
0 20180307 1 1
3 20180308 1 3
4 20180308 2 3
Upvotes: 1
Reputation: 323316
IIUC using merge
in pandas
full_sessions.merge(conversion,on='user_id',how='left').loc[lambda x : (x.date_y>x.date_x)|(x.date_y.isnull())].dropna(1)
Out[397]:
date_x result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2
Upvotes: 1