Reputation: 786
I grouped a dataframe test_df2
by frequency 'B'
(by business day, so each name of the group is the date of that day at 00:00) and am now looping over the groups to calculate timestamp differences and save them in the dict grouped_bins
. The data in the original dataframe and the groups looks like this:
timestamp | status | externalId | |
---|---|---|---|
0 | 2020-05-11 13:06:05.922 | 1 | 1 |
7 | 2020-05-11 13:14:29.759 | 10 | 1 |
8 | 2020-05-11 13:16:09.147 | 1 | 2 |
16 | 2020-05-11 13:19:08.641 | 10 | 2 |
What I want is to calculate the difference between each row's timestamp
, for example of rows 7
and 0
, since they have the same externalId
.
What I did for that purpose is the following.
# Group function. Dataframes are saved in a dict.
def groupDataWithFrequency(self, dataFrameLabel: str, groupKey: str, frequency: str):
'''Group time series by frequency. Starts at the beginning of the data frame.'''
print(f"Binning {dataFrameLabel} data with frequency {frequency}")
if (isinstance(groupKey, str)):
return self.dataDict[dataFrameLabel].groupby(pd.Grouper(key=groupKey, freq=frequency, origin="start"))
grouped_jobstates = groupDataWithFrequency("jobStatus", "timestamp", frequency)
After grouping, I loop over each group (to maintain the day) and try to calculate the difference between the timestamos, which is where it goes wrong.
grouped_bins = {}
def jobStatusPRAggregator(data, name):
if (data["status"] == 1):
# Find corresponding element in original dataframe
correspondingStatus = test_df2.loc[(test_df2["externalId"] == data["externalId"]) & (test_df2["timestamp"] != data["timestamp"])]
# Calculate time difference
time = correspondingStatus["timestamp"] - data["timestamp"]
# some prints:
print(type(time))
# <class 'pandas.core.series.Series'> --> Why is this a series?
print(time.array[0])
# 0 days 00:08:23.837000 --> This looks correct, I think?
print(time)
# 7 0 days 00:08:23.837000
# Name: timestamp, dtype: timedelta64[ns]
# Check if element exists in dict
elem = next((x for x in grouped_bins if ((x["startDate"] == name) ("productiveTime" in x))), None)
# If does not exist yet, add to dict
if elem is None:
grouped_bins.append( {"startDate": name, "productiveTime": time })
else:
elem["productiveTime"] = elem["productiveTime"]
# See below for problem
# Loop over groups
for name, group in grouped_jobstates:
group.apply(jobStatusPRAggregator, args=(name,), axis=1)
The problem I face is the following. The element in the dict (elem
) looks like this in the end:
{'startDate': Timestamp('2020-05-11 00:00:00', freq='B'), 'productiveTime': 0 NaT
7 NaT
8 NaT
16 NaT
17 NaT
..
1090 NaT
1091 NaT
1099 NaT
1100 NaT
1107 NaT
Name: timestamp, Length: 254, dtype: timedelta64[ns]}
What I want is something like this:
{'startDate': Timestamp('2020-05-11 00:00:00', freq='B'), 'productiveTime': 2 Days 12 hours 39 minutes 29 seconds
Name: timestamp, Length: 254, dtype: timedelta64[ns]}
Though I am open to suggestions on how to store time durations in Python/Pandas.
I am also open to suggestions regarding the loop itself.
Upvotes: 0
Views: 1262
Reputation: 147
To convert your timestamp strings to a datetime object:
import datetime
datetime_format = "%Y-%m-%d %H:%M:%S"
datetime_obj = datetime.strptime(datetime_string, datetime_format)
Then from your code above:
# Calculate time difference
time1 = datetime.strptime(correspondingStatus["timestamp"], datetime_format)
time2 = datetime.strptime(data["timestamp"], datetime_format)
time = time1 - time2
Upvotes: 1
Reputation: 26201
To obtain timestamp differences between consecutive rows of the same externalId
, you should be able to simply write, for example:
df2 = df.assign(delta=df.groupby('externalId')['timestamp'].diff())
On the example you give:
>>> df2
timestamp status externalId delta
0 2020-05-11 13:06:05.922 1 1 NaT
7 2020-05-11 13:14:29.759 10 1 0 days 00:08:23.837000
8 2020-05-11 13:16:09.147 1 2 NaT
16 2020-05-11 13:19:08.641 10 2 0 days 00:02:59.494000
If your timestamps are not actually of type Timestamp
yet, then you can convert them into it first:
df['timestamp'] = pd.to_datetime(df['timestamp'])
Upvotes: 2