mtndoe
mtndoe

Reputation: 444

create unique list: comparing dict objects

I have a list of objects: with an id, a date and an indication of the type of object. for example

original_list = [{'id':1,'date':'2016-01-01','type':'A'},
                 {'id':2,'date':'2016-02-01','type':'B'},
                 {'id':3,'date':'2016-03-01','type':'A'},
                 {'id':1,'date':'2016-04-01','type':'C'}]

As shown above this list can contain duplicate id's and different dates, types. Now I want to create a list of unique id's which contains only the last entries (based on date). Now I have a procedure as followed:

# Create list of unique id's
unique_ids = list(set([foo.get('id') for foo in original_list]))

# find last contact
for unique_id in unique_ids:
    foo_same_id = [foo for foo in original_list if foo.get('id') == unique_id]
    if len(foo_same_id) == 1:
        # use this one
    else:
        latest_date = [foo.get('date') for foo in foo_same_id]
        latest_date = max(latest_date)
        latest_object = [foo for foo in foo_same_id if foo.get('date') == latest_date]

After this the list with the same id's is sorted on the date and is the last value of type used to fill in the type of the object. At that time I don't need these objects anymore and make a copy of the two lists (original_list and unique_ids) without the processed objects/ids.

This seems to work but when applied to 200.000 + it takes a lot of time (+ 4 hours). Are there ways to speed this up? Different implementations? Currently I'm reading in the data from a database and start processing immediately.

Upvotes: 2

Views: 80

Answers (2)

salparadise
salparadise

Reputation: 5805

Dedup the original by using a custom function that only walks the list once and flattens it at the end:

def dedup_original(original):
    items = {}
    for item in original:
        if item['id'] in items:
            if items[item['id']]['date'] < item['date']:
                items[item['id']] = item
        else:
             items[item['id']] = item
    return list(items.values())

Result:

In [28]: dedup_original(original_list)
Out[28]:
[{'date': '2016-04-01', 'id': 1, 'type': 'C'},
 {'date': '2016-02-01', 'id': 2, 'type': 'B'},
 {'date': '2016-03-01', 'id': 3, 'type': 'A'}]

Upvotes: 0

Kasravnd
Kasravnd

Reputation: 107297

Instead of creating all unique ids using set and other extra operations, and then looping over the list and using all of those extra operations, you can simply use a custom dictionary in order to preserve the your dictionaries based on their ids. And due to the fact that dictionaries only keep the unique items if you override the __setitem__ method in a way that it only replaces the values based on their date (if it's greater than the current one) you'll simply create your desire list.

from datetime import datetime


class UniqueDict(dict):
    def __init__(self, *args, **kwds):
        super(UniqueDict, self).__init__(*args, **kwds)

    def __setitem__(self, _id, value):
        current = self.get(_id)
        if current:
            date_obj = datetime.strptime(value['date'], '%Y-%m-%d')
            current_date_obj = datetime.strptime(self[_id]['date'], '%Y-%m-%d')
            if date_obj > current_date_obj:
                dict.__setitem__(self, _id, value)
        else:
            dict.__setitem__(self, _id, value)

Demo:

original_list = [{'id':1,'date':'2016-01-01','type':'A'},
                 {'id':2,'date':'2016-02-01','type':'B'},
                 {'id':3,'date':'2016-03-01','type':'A'},
                 {'id':1,'date':'2016-04-01','type':'C'}]


udict = UniqueDict()

for d in original_list:
    udict[d['id']] = d

print(udict)

output:

{1: {'id': 1, 'date': '2016-04-01', 'type': 'C'},
 2: {'id': 2, 'date': '2016-02-01', 'type': 'B'},
 3: {'id': 3, 'date': '2016-03-01', 'type': 'A'}}

Note that as mentioned in comment, in this case you can also drop using datetime for converting your date strings to date objects for comparison ,since ISO formatted dates can be compared lexicographically.

Upvotes: 1

Related Questions