Reputation: 444
I have a list of objects: with an id, a date and an indication of the type of object. for example
original_list = [{'id':1,'date':'2016-01-01','type':'A'},
{'id':2,'date':'2016-02-01','type':'B'},
{'id':3,'date':'2016-03-01','type':'A'},
{'id':1,'date':'2016-04-01','type':'C'}]
As shown above this list can contain duplicate id's and different dates, types. Now I want to create a list of unique id's which contains only the last entries (based on date). Now I have a procedure as followed:
# Create list of unique id's
unique_ids = list(set([foo.get('id') for foo in original_list]))
# find last contact
for unique_id in unique_ids:
foo_same_id = [foo for foo in original_list if foo.get('id') == unique_id]
if len(foo_same_id) == 1:
# use this one
else:
latest_date = [foo.get('date') for foo in foo_same_id]
latest_date = max(latest_date)
latest_object = [foo for foo in foo_same_id if foo.get('date') == latest_date]
After this the list with the same id's is sorted on the date and is the last value of type used to fill in the type of the object. At that time I don't need these objects anymore and make a copy of the two lists (original_list and unique_ids) without the processed objects/ids.
This seems to work but when applied to 200.000 + it takes a lot of time (+ 4 hours). Are there ways to speed this up? Different implementations? Currently I'm reading in the data from a database and start processing immediately.
Upvotes: 2
Views: 80
Reputation: 5805
Dedup the original by using a custom function that only walks the list once and flattens it at the end:
def dedup_original(original):
items = {}
for item in original:
if item['id'] in items:
if items[item['id']]['date'] < item['date']:
items[item['id']] = item
else:
items[item['id']] = item
return list(items.values())
Result:
In [28]: dedup_original(original_list)
Out[28]:
[{'date': '2016-04-01', 'id': 1, 'type': 'C'},
{'date': '2016-02-01', 'id': 2, 'type': 'B'},
{'date': '2016-03-01', 'id': 3, 'type': 'A'}]
Upvotes: 0
Reputation: 107297
Instead of creating all unique ids using set and other extra operations, and then looping over the list and using all of those extra
operations, you can simply use a custom dictionary in order to preserve the your dictionaries based on their ids. And due to the fact that dictionaries only keep
the unique items if you override the __setitem__
method in a way that it only replaces the values based on their date (if it's greater than the current one)
you'll simply create your desire list.
from datetime import datetime
class UniqueDict(dict):
def __init__(self, *args, **kwds):
super(UniqueDict, self).__init__(*args, **kwds)
def __setitem__(self, _id, value):
current = self.get(_id)
if current:
date_obj = datetime.strptime(value['date'], '%Y-%m-%d')
current_date_obj = datetime.strptime(self[_id]['date'], '%Y-%m-%d')
if date_obj > current_date_obj:
dict.__setitem__(self, _id, value)
else:
dict.__setitem__(self, _id, value)
Demo:
original_list = [{'id':1,'date':'2016-01-01','type':'A'},
{'id':2,'date':'2016-02-01','type':'B'},
{'id':3,'date':'2016-03-01','type':'A'},
{'id':1,'date':'2016-04-01','type':'C'}]
udict = UniqueDict()
for d in original_list:
udict[d['id']] = d
print(udict)
output:
{1: {'id': 1, 'date': '2016-04-01', 'type': 'C'},
2: {'id': 2, 'date': '2016-02-01', 'type': 'B'},
3: {'id': 3, 'date': '2016-03-01', 'type': 'A'}}
Note that as mentioned in comment, in this case you can also drop using datetime
for converting your date strings to date objects for comparison ,since ISO formatted dates can be compared lexicographically.
Upvotes: 1