Sannin
Sannin

Reputation: 25

Get only first duplicates in list of dicts with python

I have a list of dicts like this(could have up to 12000 entries though):

[
{'date': datetime.datetime(2016, 1, 31, 0, 0), 'title': 'Entry'}, 
{'date': datetime.datetime(2016, 1, 11, 0, 0), 'title': 'Something'},
{'date': datetime.datetime(2016, 1, 01, 0, 0), 'title': 'Entry'}
]

The first entries are the newest. I want to delete duplicates with same title but keep the oldest ones.

Upvotes: 0

Views: 77

Answers (2)

jDo
jDo

Reputation: 4010

I think this does what you want but I'm also using a dictionary rather than a list. It seems better suited to this type of data:

import datetime

dict_list = [
    {'date': datetime.datetime(2016, 1, 31, 0, 0), 'title': 'Entry'},
    {'date': datetime.datetime(2016, 1, 11, 0, 0), 'title': 'Something'},
    {'date': datetime.datetime(2016, 1, 01, 0, 0), 'title': 'Entry'}
]

dict_keys = set(map(lambda x: x["title"], dict_list))

earliest_entries = {k:min(x["date"] for x in dict_list if x["title"] == k) for k in dict_keys}

Output:

>>> earliest_entries
{'Entry': datetime.datetime(2016, 1, 1, 0, 0), 'Something': datetime.datetime(2016, 1, 11, 0, 0)}
>>> 

Upvotes: 1

Tadhg McDonald-Jensen
Tadhg McDonald-Jensen

Reputation: 21474

If you want to keep the list in the format it is in then you can just keep a set of seen unique titles and go through the list either deleting entries or adding to seen:

def r_enumerate(iterable):
    #use itertools.izip and xrange if you are using python 2!
    return zip(reversed(range(len(iterable))), 
               reversed(iterable))

seen = set()
for i, subdata in r_enumerate(data):
    if subdata['title'] in seen:
        del data[i]
    else:
        seen.add(subdata['title'])

This won't modify the order of the data, traversing it backwards means that the later (older) entries are kept, and because you are traversing it backwards you don't have to worry about deleting items messing up the rest of iteration.


On the other hand if you are willing to use a dictionary to store all the entries instead of a list of little dictionaries this is really, really easy:

{partdict['title']: partdict['date'] for partdict in LIST_OF_DICTS}

When evaluating the entries that come later in the list will override the previous ones so this will only keep the oldest entries, not to mention that you can then index the entries by their title instead of their place in the list.

To get back to the list format (but only contain the oldest entry of each name) you can do something like:

[{'title':title, 'date':date} for title,date in DICT_FORM]

Although this will mess up the order and be a lot more work if you want to leave it in this format in the first place.

Upvotes: 2

Related Questions