Outcast
Outcast

Reputation: 5117

Uniqify list of dicts based on specific keys - Keep specific occurrences in cases of duplicates

Let's suppose that I have a list of dicts like this:

list = [{'key':1,'timestamp':1234567890,'action':'like','type':'photo','id':245},
        {'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252},
        {'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}]

I want to uniqify the list of dicts based on key and timestamp.

Specifically, I want to keep dicts with unique key and keep the most recent timestamp when there are duplicate keys based on key.

Therefore, I want to have the following:

list = [{'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}`
        {'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252}]

How can I efficiently do this?

Upvotes: 1

Views: 55

Answers (5)

kederrac
kederrac

Reputation: 17322

my_list = [{'key':1,'timestamp':1234567890,'action':'like','type':'photo','id':245},
        {'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252},
        {'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}]

r = {}

for d in my_list:
    k = d['key']
    if k not in r or r[k]['timestamp'] < d['timestamp']:
        r[k] = d

list(r.values())

output:

[{'key': 1,
  'timestamp': 3456789012,
  'action': 'like',
  'type': 'photo',
  'id': 212},
 {'key': 2,
  'timestamp': 2345678901,
  'action': 'like',
  'type': 'photo',
  'id': 252}]

here is a simple benchmark between most of the proposed solutions:

enter image description here

from itertools import groupby
import itertools
from operator import itemgetter

from simple_benchmark import BenchmarkBuilder

b = BenchmarkBuilder()

@b.add_function()
def kederrac(lst):
    r = {}
    for d in lst:
        k = d['key']
        if k not in r or r[k]['timestamp'] < d['timestamp']:
            r[k] = d

    return list(r.values())

@b.add_function()
def Daweo(lst):
    s = sorted(lst, key=lambda x:(x['key'],x['timestamp']), reverse=True)
    return [next(g) for k, g in itertools.groupby(s, lambda x:x['key'])]

@b.add_function()
def Jan(lst):
    result = []
    sorted_lst = sorted(lst, key=lambda x: x['key'])
    for k,v in groupby(sorted_lst, key = lambda x: x['key']):
        result.append(max(v, key=lambda x: x['timestamp']))
    return result

@b.add_function()
def Jan_one_line(lst):
    keyfunc = lambda x: x['key']
    return [max(v, key = lambda x: x['timestamp'])
            for k, v in groupby(sorted(lst, key=keyfunc), key=keyfunc)]

@b.add_function()
def gold_cy(lst):
    key = itemgetter('key')
    ts = itemgetter('timestamp')

    def custom_sort(item): 
        return (key(item), -ts(item))

    results = []
    for k, v in groupby(sorted(lst, key=custom_sort), key=key):
        results.append(next(v))

    return results

@b.add_arguments('Number of dictionaries in list')
def argument_provider():
    for exp in range(2, 18):
        size = 2**exp

        yield size, [{'key':choice(range((size // 10) or 2)),
                      'timestamp': randint(1_000_000_000, 10_000_000_000),
                      'action':'like','type':'photo','id':randint(100, 10000)}
                     for _ in range(size)]

r = b.run()
r.plot()

it shows that a simple for loop solution is more efficient, the result is expected since the sorted built-in function will come with a O(NlogN) time complexity

Upvotes: 2

Jan
Jan

Reputation: 43199

Another solution with itertools.groupby:

from itertools import groupby

lst = [{'key':1,'timestamp':1234567890,'action':'like','type':'photo','id':245},
       {'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252},
       {'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}]

result = []
sorted_lst = sorted(lst, key=lambda x: x['key'])
for k,v in groupby(sorted_lst, key = lambda x: x['key']):
    result.append(max(v, key=lambda x: x['timestamp']))

print(result)

Or - if you are into one-liners:

keyfunc = lambda x: x['key']
result = [max(v, key = lambda x: x['timestamp'])
          for k, v in groupby(sorted(lst, key=keyfunc), key=keyfunc)]


Additionally, do not name your variables like builtin-functions, e.g. list or id. id(...) returns the identity of an object (random, but unique within the same program).

Upvotes: 1

gold_cy
gold_cy

Reputation: 14236

We can use a combination of itertools.groupby and itemgetter. One caveat is that data must be presorted for itertools.groupby to work properly.

from itertools import groupby
from operator import itemgetter

key = itemgetter('key')
ts = itemgetter('timestamp')

def custom_sort(item): 
    return (key(item), -ts(item))

results = []
for k, v in groupby(sorted(data, key=custom_sort), key=key):
    results.append(next(v))

[{'id': 212,
  'action': 'like',
  'key': 1,
  'timestamp': 3456789012,
  'type': 'photo'},
 {'id': 252,
  'action': 'like',
  'key': 2,
  'timestamp': 2345678901,
  'type': 'photo'}]

As a side note, don't name variable using built-in names like list or id.

Upvotes: 0

Daweo
Daweo

Reputation: 36808

You could do that using itertools.groupby following way:

import itertools
lst = [{'key':1,'timestamp':1234567890,'action':'like','type':'photo','id':245},{'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252},{'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}]
s = sorted(lst, key=lambda x:(x['key'],x['timestamp']), reverse=True)
uniq_lst = [next(g) for k, g in itertools.groupby(s, lambda x:x['key'])]

Output:

[{'key': 2, 'timestamp': 2345678901, 'action': 'like', 'type': 'photo', 'id': 252}, {'key': 1, 'timestamp': 3456789012, 'action': 'like', 'type': 'photo', 'id': 212}]

Firstly I sort by key, timestamp so elements with same key will be adjacent and also reverse so highest timestamp will be first. Then I group elements with same key and get first record from every group.

Upvotes: 0

Ashwani
Ashwani

Reputation: 2052

Easiest way would be to insert it into a dict and then read back all the values as list. Also you should not use list as name of a variable.

d = {} 
for item in lst: 
    key = item['key'] 
    if key not in d or item['timestamp'] > d[key]['timestamp']: 
        d[key] = item 
list(s.values()) 

Upvotes: 1

Related Questions