Reputation: 5117
Let's suppose that I have a list of dicts like this:
list = [{'key':1,'timestamp':1234567890,'action':'like','type':'photo','id':245},
{'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252},
{'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}]
I want to uniqify the list of dicts based on key
and timestamp
.
Specifically, I want to keep dicts with unique key
and keep the most recent timestamp
when there are duplicate keys
based on key
.
Therefore, I want to have the following:
list = [{'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}`
{'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252}]
How can I efficiently do this?
Upvotes: 1
Views: 55
Reputation: 17322
my_list = [{'key':1,'timestamp':1234567890,'action':'like','type':'photo','id':245},
{'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252},
{'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}]
r = {}
for d in my_list:
k = d['key']
if k not in r or r[k]['timestamp'] < d['timestamp']:
r[k] = d
list(r.values())
output:
[{'key': 1,
'timestamp': 3456789012,
'action': 'like',
'type': 'photo',
'id': 212},
{'key': 2,
'timestamp': 2345678901,
'action': 'like',
'type': 'photo',
'id': 252}]
here is a simple benchmark between most of the proposed solutions:
from itertools import groupby
import itertools
from operator import itemgetter
from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()
@b.add_function()
def kederrac(lst):
r = {}
for d in lst:
k = d['key']
if k not in r or r[k]['timestamp'] < d['timestamp']:
r[k] = d
return list(r.values())
@b.add_function()
def Daweo(lst):
s = sorted(lst, key=lambda x:(x['key'],x['timestamp']), reverse=True)
return [next(g) for k, g in itertools.groupby(s, lambda x:x['key'])]
@b.add_function()
def Jan(lst):
result = []
sorted_lst = sorted(lst, key=lambda x: x['key'])
for k,v in groupby(sorted_lst, key = lambda x: x['key']):
result.append(max(v, key=lambda x: x['timestamp']))
return result
@b.add_function()
def Jan_one_line(lst):
keyfunc = lambda x: x['key']
return [max(v, key = lambda x: x['timestamp'])
for k, v in groupby(sorted(lst, key=keyfunc), key=keyfunc)]
@b.add_function()
def gold_cy(lst):
key = itemgetter('key')
ts = itemgetter('timestamp')
def custom_sort(item):
return (key(item), -ts(item))
results = []
for k, v in groupby(sorted(lst, key=custom_sort), key=key):
results.append(next(v))
return results
@b.add_arguments('Number of dictionaries in list')
def argument_provider():
for exp in range(2, 18):
size = 2**exp
yield size, [{'key':choice(range((size // 10) or 2)),
'timestamp': randint(1_000_000_000, 10_000_000_000),
'action':'like','type':'photo','id':randint(100, 10000)}
for _ in range(size)]
r = b.run()
r.plot()
it shows that a simple for
loop solution is more efficient, the result is expected since the sorted
built-in function will come with a O(NlogN) time complexity
Upvotes: 2
Reputation: 43199
Another solution with itertools.groupby
:
from itertools import groupby
lst = [{'key':1,'timestamp':1234567890,'action':'like','type':'photo','id':245},
{'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252},
{'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}]
result = []
sorted_lst = sorted(lst, key=lambda x: x['key'])
for k,v in groupby(sorted_lst, key = lambda x: x['key']):
result.append(max(v, key=lambda x: x['timestamp']))
print(result)
Or - if you are into one-liners:
keyfunc = lambda x: x['key']
result = [max(v, key = lambda x: x['timestamp'])
for k, v in groupby(sorted(lst, key=keyfunc), key=keyfunc)]
list
or id
. id(...)
returns the identity of an object (random, but unique within the same program).
Upvotes: 1
Reputation: 14236
We can use a combination of itertools.groupby
and itemgetter
. One caveat is that data must be presorted for itertools.groupby
to work properly.
from itertools import groupby
from operator import itemgetter
key = itemgetter('key')
ts = itemgetter('timestamp')
def custom_sort(item):
return (key(item), -ts(item))
results = []
for k, v in groupby(sorted(data, key=custom_sort), key=key):
results.append(next(v))
[{'id': 212,
'action': 'like',
'key': 1,
'timestamp': 3456789012,
'type': 'photo'},
{'id': 252,
'action': 'like',
'key': 2,
'timestamp': 2345678901,
'type': 'photo'}]
As a side note, don't name variable using built-in names like list
or id
.
Upvotes: 0
Reputation: 36808
You could do that using itertools.groupby following way:
import itertools
lst = [{'key':1,'timestamp':1234567890,'action':'like','type':'photo','id':245},{'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252},{'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}]
s = sorted(lst, key=lambda x:(x['key'],x['timestamp']), reverse=True)
uniq_lst = [next(g) for k, g in itertools.groupby(s, lambda x:x['key'])]
Output:
[{'key': 2, 'timestamp': 2345678901, 'action': 'like', 'type': 'photo', 'id': 252}, {'key': 1, 'timestamp': 3456789012, 'action': 'like', 'type': 'photo', 'id': 212}]
Firstly I sort by key, timestamp so elements with same key will be adjacent and also reverse so highest timestamp will be first. Then I group elements with same key and get first record from every group.
Upvotes: 0
Reputation: 2052
Easiest way would be to insert it into a dict and then read back all the values as list. Also you should not use list
as name of a variable.
d = {}
for item in lst:
key = item['key']
if key not in d or item['timestamp'] > d[key]['timestamp']:
d[key] = item
list(s.values())
Upvotes: 1