Reputation: 1309
A source is a visit if for the row above (1) the company above is the same company and (2) that type is home. The dataframe is sorted. But relying on the previous row means if there are rows in between, a visit is not being classified: here, row 1 is getting in the away row 2 being a visit. How could I classify these visits as long as the difference in time is within 5 minutes?
source datetime location type start company
0 10:00 london home 1 apple
1 10:03 unknown tesla
2 10:04 France apple
3 10:05 Melbourne home 1 apple
4 visit 10:06 France apple
10.04 is within 5 minutes of 10.00 so row 2 should be a visit. It also meets the 2 conditions of a visit. Expected Output
source datetime location type start company
0 10:00 london home 1 apple
1 10:03 unknown tesla
2 visit 10:04 France apple
3 10:05 Melbourne home 1 apple
4 visit 10:06 France apple
Upvotes: 0
Views: 51
Reputation: 11650
here is one way to do it
#create a reference date, with datetime where source is 'home'
df['ref_date'] = df[df['type'].str.strip() !='']['datetime']
#downfill the ref_date grouping by company
df['ref_date']=df.groupby('company')['ref_date'].fillna(method='ffill').fillna(0)
# use np.where to populate the source, where datetime and ref-date are different
# and the time difference is 5 mins or less
df['source']=np.where( ((df['datetime']!=df['ref_date']) &
((pd.to_datetime(df['datetime']).sub(pd.to_datetime(df['ref_date'])).dt.total_seconds()/60) <=5)),
'visit',df['source'])
df=df.drop(columns='ref_date')
df
source datetime location type start company
0 10:00 london home 1.0 apple
1 10:03 unknown tesla
2 visit 10:04 France apple
3 10:05 Melbourne home 1.0 apple
4 visit 10:06 France apple
Upvotes: 1