Reputation: 1727
I'm manipulating observational behavioral data, in which each row entry is a behavior, with a timestamp. The dataset includes multiple observational periods, and the behaviors were recorded in real time, with timestamps using Epoch time notation. I want to create a new column in my dataframe that labels chunks of time within each observational period. Specifically, I want to label row entries made in 5 minute blocks from the start of each observational period. (I need to use these to calculate the frequencies of different behaviors over a fixed time period). The data look like this (there are more columns involved, but I've left them out for ease of presentation):
observation epoch behavior
1 12973561838 vo
1 12973561870 bc
1 12973561944 xp
1 12973562055 vo
1 12973562106 vo
2 12731709280 wc
2 12731709322 we
2 12731709361 vb
2 12731709374 vw
...
What I would like the product to look like is the following:
observation epoch behavior timeblock
1 12973561838 vo 1
1 12973561870 bc 1
1 12973561944 xp 2
1 12973562055 vo 3
1 12973562106 vo 3
2 12731709280 wc 4
2 12731709322 we 5
2 12731709361 vb 5
2 12731709374 vw 5
...
Where "timeblock" #1 will include the first 4:59 minutes of observation period #1, #2 will include 5:00 to 9:59 minutes...through to 25:00 and over, for each observation period. In this example, #4 will include the first 4:59 minutes of observation period #2, and so on. (I realize the Epoch time values don't match up with my time blocks here, but that's because I've abbreviated the data and just grabbed some timestamps haphazardly. I don't think this will hurt comprehension).
What I've tried so far: I have messed around with groupby, and calculated actual epoch values at the start and end of 5 minute time blocks for each of the observation periods, in a separate dataframe. But I can't see how to bring these to bear in a function that applies per observation period in the original dataframe above, where there are multiple values for each observational period. I suspect the answer lies in a better comprehension of groupby and the apply method, but I'm having trouble getting this off the ground. (Also, maybe I am not using the right search terms, but I'm not finding much on this posted to the forum already. I'm only able to find info on working with timeseries). Two options I've considered but can't figure out how to program:
Calculate actual Epoch time points for each observational period at which I could split the period into 5 minute time blocks based on the minimum Epoch value for each observational period (this part is done), and then use this to generate a column of time block labels (this part has me stuck).
Calculate time-since-start for each row entry within an observational period, based on the minimum Epoch value for that observational period (stuck here as well). Then, instead of using a different list of values for each observational period (as in conceptual solution #1), use a standard range of values (minutes 0 to 4:59, 5 to 9:59, etc.) to create column of time block labels. I'm stuck on how to even start this one.
Your help is much appreciated!
updated clarification, below
I've already used groupby to create a table of the initial timestamps for each 'observation' period:
g_follow = teach_freq['Epoch'].groupby(teach_freq['observation'])
start_follow = g_follow.min()
I've put this into a data frame along with the max (which marks the end timestamp for each 'observation' period). That produces the following dataframe, with 225 'observation' periods:
observation min max
1 12973561838 12973563058
2 12973571418 12973572718
3 12973825256 12973826474
...
Note that each 'observation' period has a different 'min' value, or different start time. Option #1 above would mean I need to write code to subtract the 'min' code for each observation from every 'Epoch' entry in my larger database.
UPDATE: Based on Dmitry's suggested code, I'm trying to use the following:
#where the dataframe with timeblocks & start times is named blocks
#each observation period is in column 'follow'
#and each start time for the observation periods is in column 'first'
min_time = lambda row: row['Epoch'] - blocks[blocks['follow'] == row['follow']]['first']
Followed by:
#where the dataframe with observed & timestamped behaviors is named teach_freq
teach_freq['std_epoch'] = teach_freq.apply(min_time, axis=1)
But when I run this I get the following error:
ValueError: Wrong number of items passed 1, indices imply 225
Upvotes: 2
Views: 2132
Reputation: 163
I encountered this problem before - this is my solution.
Observation starts from some timestamp. We can subtract the initial timestamp from each row, so as a result we would have all epochs starting from time 0.
timeseries['timeblock'] = timeseries['epoch'] - timeseries['epoch'][0]
This 'normalized' time field we can map to the 5-minute interval:
timeseries['timeblock'] = timeseries['timeblock'].map(lambda x: int(x/300))
I started using pandas not so long ago, so probably there exists more pandas-like solution
[Edit] You start your timeblocks with 1, so the right code would be
timeseries['timeblock'] = timeseries['timeblock'].map(lambda x: int(x/300)+1)
[Edit] Update - I think you can use the apply function to subtract the right min time
# let's say that data frame with observations and their mins is called omf
min_time = lambda row: row['epoch'] - omf[omf['observation'] == row['observation']]['min']
timeseries['new_epoch'] = timeseries.apply(min_time,axis=1)
[Edit] Update Full code - using your notation and series:
# Notice epoch lowercase, panda column names are case sensitive
g_follow = teach_freq['epoch'].groupby(teach_freq['observation'])
start_follow = g_follow.min()
# Important - start_follow is a Series where observation is an index
blocks = start_follow # to have the same notation
# main part - using Series instead of DataFrame makes the indexing simpler
min_time = lambda row: row['epoch'] - blocks[row['observation']]
teach_freq['std_epoch'] = teach_freq.apply(min_time, axis=1)
Upvotes: 3