frazman
frazman

Reputation: 33243

Remove duplicates while merging lists of lists into single list in python

So, I have lists of lists like following:

data = [
['foo', 'bar'],
['one', 'two']
]

And, I want to flatten these lists by alternating between two lists. So, output like

flattened = ['foo', 'one', 'bar', 'two']

I am using the list(chain.from_iterable(zip_longest(*data))) which works fine.

But, I am trying to figure out how to handle scenarios where there are duplicates that I want to get rid of.

data = [
['foo', 'bar'],
['foo', 'two']
]

I want something like

flatted = ['foo', 'two', 'bar'] 

rather than ['foo', 'foo', 'bar', 'two']

How do I do this?

Upvotes: 1

Views: 232

Answers (4)

Massifox
Massifox

Reputation: 4487

Try this code:

list(dict.fromkeys(sum(data, [])))

EDIT: As pointed out in the comments, the sum is not the most efficient method to flatten a list, you can use itertools.chain.from_iterable to get the flattened list, then do the following:

list(dict.fromkeys(chain.from_iterable(data)))

In both cases the output is the following:

['foo', 'bar', 'two']

Comparison of execution times

Below they propose a comparison of the execution times of the main proposed solutions:

  1. Benchmark data1:

     data1 = [['foo', 'bar'],['foo', 'two']] * 1000000
    
    1. @Massifox's solution with itertools.chain.from_iterable

      %%timeit
      list(chain.from_iterable(data1))
      # 128 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
      
    2. @Marat's solution with dedup

      %%timeit
      list(dedup(chain.from_iterable(zip_longest(*data1))))
      # 579 ms ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
      
    3. @Alexander's solution with zip_longest

      %%timeit
      list(dict.fromkeys(chain.from_iterable(zip_longest(*data1))))
      # 456 ms ± 149 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 
      
  2. Benchmark data2:

    x = 10
    y = 500000
    n_max = 1000
    data2 = [[np.random.randint(1, n_max) for _ in range(0, x)] for _ in range(0, y)]
    
    1. @Massifox's solution with itertools.chain.from_iterable

      %%timeit
      list(chain.from_iterable(data2))
      # 241 ms ± 20 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
      
    2. @Marat's solution with dedup

      %%timeit
      list(dedup(chain.from_iterable(zip_longest(*data2))))
      # 706 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
      
    3. @Alexander's solution with zip_longest

      %%timeit
      list(dict.fromkeys(chain.from_iterable(zip_longest(*data2))))
      # 674 ms ± 56.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
      

The implementations based on sum and on Counter are decisively less efficient and take tens of seconds already with instances of smaller benchmark of [['foo', 'bar'],['foo', 'two']] * 100k.

With benchmark data1, the solution based on itertools.chain.from_iterable proposed by me seems to be about 4-5 times faster than the others.

With benchmark data2, the solution based on itertools.chain.from_iterable proposed by me seems to be about 2-3 times faster than the others.

Upvotes: 0

Alexander
Alexander

Reputation: 109546

Use a set to keep track of what you've already seen, which is an O(1) membership test.

result = []
seen = set()
for item in chain.from_iterable(zip_longest(*data)):
    if item not in seen:
        seen.add(item)
        result.append(item)
>>> result
['foo', 'bar', 'two']

Note that this question talks about removing duplicates from a list: Removing duplicates in lists

TL;DR

For Python 3.7+ (or Cython 3.6+):

>>> list(dict.fromkeys(chain.from_iterable(zip_longest(*data))))
['foo', 'bar', 'two']

Upvotes: 4

Brian
Brian

Reputation: 1604

Hmm, this might be a bit more overhead than you're looking for but this should work and guarantees list-wise order unlike sets:

from itertools import cycle
from collections import Counter

output = []
checker = Counter()

for lst in cycle(data):
    if not data:
        break

    while lst:
        item = lst.pop(0)
        if not checker[item]:
            output.append(item)
            checker[item] += 1
            break            

    if not lst:
        data.remove(lst)
        continue  

output:

['foo', 'two', 'bar']

Upvotes: 0

juancarlos
juancarlos

Reputation: 631

you can create a set and then convert to list again some like this:

l1 = ['foo', 'foo', 'bar', 'two']
l2 = list(set(l1))

it's create a second list without repeated items

if you want keept the order you can do this

ordered = dict.fromkeys(l1).keys()

Upvotes: -3

Related Questions