pete lee
pete lee

Reputation: 95

only keep the last time stamp on certain day python

Hey sorry for being a rookie with python, just need a point in the right direction.

Here is the data i am working with:

 2018-12-14 14:36:00
 2018-12-15 02:00:00
 2018-12-15 04:48:00
 2018-12-16 06:12:00
 2018-12-16 11:28:00
 2018-12-16 23:52:00

What i need to do is only keep the last entry for each day. Is it wise take the values which are stored in a numpy array and append them to a list. Will this be my best option? I ask because I know I cannot delete values from a numpy array. Just curious if this will be easier to do this from a list or numpy array?

Upvotes: 4

Views: 557

Answers (3)

RoadRunner
RoadRunner

Reputation: 26315

If your data is not sorted, you could also store your dates in collections.defaultdict(), then take the maximum datetime:

from collections import defaultdict
from datetime import datetime

data = [
    '2018-12-14 14:36:00',
    '2018-12-15 02:00:00',
    '2018-12-15 04:48:00',
    '2018-12-16 06:12:00',
    '2018-12-16 11:28:00',
    '2018-12-16 23:52:00'
]

d = defaultdict(list)
for item in data:
    date, _ = item.split()
    d[date].append(item)

print([max(v, key=lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S")) for _, v in d.items()])
# ['2018-12-14 14:36:00', '2018-12-15 04:48:00', '2018-12-16 23:52:00']

The above uses datetime.datetime.strptime() to get the maximum datetime, as specified in the key parameter.

If your data is already sorted, you can take advantage of itertools.groupby():

from itertools import groupby

data = [
    '2018-12-14 14:36:00',
    '2018-12-15 02:00:00',
    '2018-12-15 04:48:00',
    '2018-12-16 06:12:00',
    '2018-12-16 11:28:00',
    '2018-12-16 23:52:00'
]

print([list(g)[-1] for _, g in groupby(data, key=lambda x: x.split()[0])])
# ['2018-12-14 14:36:00', '2018-12-15 04:48:00', '2018-12-16 23:52:00']

Upvotes: 0

wim
wim

Reputation: 362717

I noticed that your times are ordered. If that is the case, you may simply convert to dict to keep the last entry:

>>> times
['2018-12-14 14:36:00',
 '2018-12-15 02:00:00',
 '2018-12-15 04:48:00',
 '2018-12-16 06:12:00',
 '2018-12-16 11:28:00',
 '2018-12-16 23:52:00']
>>> print(*dict(s.split() for s in times).items(), sep='\n')
('2018-12-14', '14:36:00')
('2018-12-15', '04:48:00')
('2018-12-16', '23:52:00')

If the ordering is not reliable, do not sort - this will degrade time complexity down to O(n log n) for what is an O(n) task. Instead, you should build a dict (keys are dates, and values are lists of times) and take maxima for the dict values.

Upvotes: 4

ApplePie
ApplePie

Reputation: 8942

If that is an option, you can use some of the facilities in pandas dataframes.

The following makes two assumptions:

  1. Your data have been sorted by time stamps;
  2. You have added an extra column on which to group, which will discard the time from the time stamp so that you can group by day.

    my_dataframe.groupby(['my_date_only_column']).my_timestamp_column.last()
    

Upvotes: 0

Related Questions