Reputation: 439
I have a data set containing 4 columns. I would like to get last entry of time
column per col1
and col3
, for col1 = 1
the last entry is 2018-05-01 23:34:00
and for col1 = 2
the last entry is 2018-05-01 05:48:00
see if the difference between last entry and second to last entry (for example for col1 = 1
is 2018-04-30 01:58:00
) is greater than 24 hours or not. That's also the interest for col2=2
.
If so, I want to get the value in col2
corresponding to last and second to last entry and change the value in col2
, corresponding to the second to last entry of time, to null
.
Here is part of data:
time col1 col2 col3
2018-04-29 02:12:00 1 50 3
2018-04-30 01:58:00 1 40 3
2018-05-01 23:34:00 1 11 3
2018-04-29 02:02:00 2 10 4
2018-05-01 05:48:00 2 25 4
The output that I would like to have is:
col1 col2 col3
1 [11, Null] 3
2 [25, Null] 4
How can I filter out the value in col2, and make it like the above given the 4 columns that I have initially?
Upvotes: 0
Views: 134
Reputation: 8816
This may help ..
>>> df
time col1 col2 col3
0 2018-04-29 02:12:00 1 50 3
1 2018-04-30 01:58:00 1 40 3
2 2018-05-01 23:34:00 1 11 3
3 2018-05-01 05:48:00 2 10 4
4 2018-04-29 02:02:00 2 25 4
We can get the last occurrence of each col1
grouping as below..
>>> df.groupby(['col1'], as_index=False).last()
col1 time col2 col3
0 1 2018-05-01 23:34:00 11 3
1 2 2018-05-01 05:48:00 25 4
Now we can store that into a variable as last_diff
for later comparison.
>>> last_diff = df.groupby(['col1'], as_index=False).last()
Now we can calculate the time difference using pd.DateOffset
looking for more then 24 hours to get the boolean exp based on comparison it will return True
or False
..
>>> last_diff['time_diff'] = (last_diff['time'] + pd.DateOffset(days=0)).dt.hour > 24
>>> last_diff
col1 time col2 col3 time_diff
0 1 2018-05-01 23:34:00 11 3 False
1 2 2018-05-01 05:48:00 25 4 False
Upvotes: 1