Reputation: 33243
So, I have lists of lists like following:
data = [
['foo', 'bar'],
['one', 'two']
]
And, I want to flatten these lists by alternating between two lists. So, output like
flattened = ['foo', 'one', 'bar', 'two']
I am using the list(chain.from_iterable(zip_longest(*data)))
which works fine.
But, I am trying to figure out how to handle scenarios where there are duplicates that I want to get rid of.
data = [
['foo', 'bar'],
['foo', 'two']
]
I want something like
flatted = ['foo', 'two', 'bar']
rather than ['foo', 'foo', 'bar', 'two']
How do I do this?
Upvotes: 1
Views: 232
Reputation: 4487
Try this code:
list(dict.fromkeys(sum(data, [])))
EDIT:
As pointed out in the comments, the sum
is not the most efficient method to flatten a list, you can use itertools.chain.from_iterable
to get the flattened list, then do the following:
list(dict.fromkeys(chain.from_iterable(data)))
In both cases the output is the following:
['foo', 'bar', 'two']
Below they propose a comparison of the execution times of the main proposed solutions:
Benchmark data1:
data1 = [['foo', 'bar'],['foo', 'two']] * 1000000
itertools.chain.from_iterable
%%timeit
list(chain.from_iterable(data1))
# 128 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
dedup
%%timeit
list(dedup(chain.from_iterable(zip_longest(*data1))))
# 579 ms ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
zip_longest
%%timeit
list(dict.fromkeys(chain.from_iterable(zip_longest(*data1))))
# 456 ms ± 149 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Benchmark data2:
x = 10
y = 500000
n_max = 1000
data2 = [[np.random.randint(1, n_max) for _ in range(0, x)] for _ in range(0, y)]
itertools.chain.from_iterable
%%timeit
list(chain.from_iterable(data2))
# 241 ms ± 20 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
dedup
%%timeit
list(dedup(chain.from_iterable(zip_longest(*data2))))
# 706 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
zip_longest
%%timeit
list(dict.fromkeys(chain.from_iterable(zip_longest(*data2))))
# 674 ms ± 56.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The implementations based on sum
and on Counter
are
decisively less efficient and take tens of seconds already with
instances of smaller benchmark of [['foo', 'bar'],['foo', 'two']] * 100k
.
With benchmark data1
, the solution based on itertools.chain.from_iterable
proposed by me seems to be about 4-5 times faster than the others.
With benchmark data2
, the solution based on itertools.chain.from_iterable
proposed by me seems to be about 2-3 times faster than the others.
Upvotes: 0
Reputation: 109546
Use a set to keep track of what you've already seen, which is an O(1) membership test.
result = []
seen = set()
for item in chain.from_iterable(zip_longest(*data)):
if item not in seen:
seen.add(item)
result.append(item)
>>> result
['foo', 'bar', 'two']
Note that this question talks about removing duplicates from a list: Removing duplicates in lists
TL;DR
For Python 3.7+ (or Cython 3.6+):
>>> list(dict.fromkeys(chain.from_iterable(zip_longest(*data))))
['foo', 'bar', 'two']
Upvotes: 4
Reputation: 1604
Hmm, this might be a bit more overhead than you're looking for but this should work and guarantees list-wise order unlike sets:
from itertools import cycle
from collections import Counter
output = []
checker = Counter()
for lst in cycle(data):
if not data:
break
while lst:
item = lst.pop(0)
if not checker[item]:
output.append(item)
checker[item] += 1
break
if not lst:
data.remove(lst)
continue
output:
['foo', 'two', 'bar']
Upvotes: 0
Reputation: 631
you can create a set and then convert to list again some like this:
l1 = ['foo', 'foo', 'bar', 'two']
l2 = list(set(l1))
it's create a second list without repeated items
if you want keept the order you can do this
ordered = dict.fromkeys(l1).keys()
Upvotes: -3