Reputation:
I have a DataFrame
like this:
Name first_seen last_seen
0 Random guy 1 5/22/2016 18:12 5/22/2016 18:15
1 Random guy 2 5/22/2016 12:03 5/22/2016 12:03
2 Random guy 3 5/22/2016 21:06 5/22/2016 21:06
3 Random guy 4 5/22/2016 16:20 5/22/2016 16:20
4 Random guy 5 5/22/2016 14:46 5/22/2016 14:46
Now I have to add a column
named Visit_period
which takes one of 4 values [morning,afternoon,evening,night]
when maximum time spent by that person (row
) fell into:
- morning: 08:00 to 12:00 hrs
- afternoon: 12:00 to 16:00 hrs
- evening: 16:00 to 20:00 hrs
- night: 20:00 to 24:00 hrs
so for above five row out put will be something like this.
visit_period
evening
afternoon
night
evening
afternoon
I have mentioned maximum time spent because, it may happen that some person's first_seen
is at 14:30 and last_seen
is 16:21. I would like to assign the value afternoon
as he spent 30 mins in afternoon slab and 21 in evening slab.
I am using python 2.7.
Upvotes: 5
Views: 1703
Reputation: 21873
You can do this:
start = pd.datetime(2016, 05, 22, 8, 00, 00)
d = ["Morning", "Afternoon", "Evening", "Night"]
def max_spent(fs, ls):
# Transform your date into timedelta in seconds:
sr = np.arange(8,25,4)*3600
fss = (fs-start).seconds
lss = (ls-start).seconds
# In which slot would it fit ?
fs_d = sr.searchsorted(fss)
ls_d = sr.searchsorted(lss)
# If it's not the same for both date:
if fs_d != ls_d:
# get the one with the biggest amount of time:
if fss - sr[fs_d - 1] > lss - sr[ls_d - 1]:
return d[fs_d-1]
else:
return d[ls_d-1]
else:
return d[ls_d-1]
Then, you just do:
df["visit_period"] = df.apply(lambda x: max_spent(x["first_seen"], x["last_seen"]), axis=1)
and you get:
df
Name first_seen last_seen visit_period
0 guy1 2016-05-22 18:12:00 2016-05-22 18:15:00 Evening
1 guy2 2016-05-22 12:03:00 2016-05-22 12:03:00 Afternoon
2 guy3 2016-05-22 21:06:00 2016-05-22 21:06:00 Night
3 guy4 2016-05-22 16:20:00 2016-05-22 16:20:00 Evening
4 guy5 2016-05-22 14:46:00 2016-05-22 14:46:00 Afternoon
5 guy6 2016-05-22 14:30:00 2016-05-22 16:21:00 Afternoon
Previous version with pd.cut, better I think if one does not need to assess which columns is best:
# Transform your date into timedelta in seconds:
df["sec"] = map(lambda x: x.seconds, df.last_seen-start)
# Apply Cut on this column:
df["visit_period"] = pd.cut(df.sec, np.arange(8,25,4)*3600, labels=d)
I've done it on last_seen only, but you can make another column with the value corresponding do the maximum time spent and then you can do this on that column.
HTH
Upvotes: 0
Reputation: 42875
You could use apply
with the below main_visit_period
function that attempts to assign a visit period according to the conditions you outlined:
times = list(range(8, 21, 4))
labels = ['morning', 'afternoon', 'evening', 'night']
periods = dict(zip(times, labels))
which gives:
{8: 'morning', 16: 'evening', 12: 'afternoon', 20: 'night'}
now the function to assign periods:
def period(row):
visit_start = {'hour': row.first_seen.hour, 'min': row.first_seen.minute} # get hour, min of visit start
visit_end = {'hour': row.last_seen.hour, 'min': row.last_seen.minute} # get hour, min of visit end
for period_start, label in periods.items():
period_end = period_start + 4
if period_start <= visit_start['hour'] < period_end:
if period_start <= visit_end['hour'] < period_end or (period_end - visit_start['hour']) * 60 - visit_start['min'] > (visit_end['hour'] - period_end) * 60 + visit_end['min']:
return label
else:
return periods[period_end] # assign label of following period
and finally .apply()
:
df['period'] = df.apply(period, axis=1)
to get:
Name first_seen last_seen period
0 Random guy 1 2016-05-22 18:12:00 2016-05-22 18:15:00 evening
1 Random guy 2 2016-05-22 12:03:00 2016-05-22 12:03:00 afternoon
2 Random guy 3 2016-05-22 21:06:00 2016-05-22 21:06:00 night
3 Random guy 4 2016-05-22 16:20:00 2016-05-22 16:20:00 evening
4 Random guy 5 2016-05-22 14:46:00 2016-05-22 14:46:00 afternoon
Upvotes: 1