Reputation: 141
I have multiple sorted iterators that yield keyed data, representable by lists:
a = iter([(1, 'a'), (2, 't'), (4, 'c')])
b = iter([(1, 'a'), (3, 'g'), (4, 'g')])
I want to merge them, using the key and keeping track of which iterator had a value for a key. This should be equivalent to a full outer join in SQL:
>>> list(full_outer_join(a, b, key=lambda x: x[0]))
[(1, 'a', 'a'), (2, 't', None), (3, None, 'g'), (4, 'c', 'g')]
I tried using heapq.merge
and itertools.groupby
, but with merge
I already lose information about the iterators:
>>> list(heapq.merge(a, b, key=lambda x: x[0]))
[(1, 'a'), (1, 'a'), (2, 't'), (3, 'g'), (4, 'c'), (4, 'g')]
So what I could use is a tag generator
def tagged(it, tag):
for item in it:
yield (tag, *x)
and merge the tagged iterators, group by the key and create a dict using the tag:
merged = merge(tagged(a, 'a'), tagged(b, 'b'), key=lambda x: x[1])
grouped = groupby(merged, key=lambda x: x[1])
[(key, {g[0]: g[2] for g in group}) for key, group in grouped]
Which gives me this usable output:
[(1, {'a': 'a', 'b': 'a'}),
(2, {'a': 't'}),
(3, {'b': 'g'}),
(4, {'a': 'c', 'b': 'g'})]
However, I think creating dicts for every group is quite costly performance wise, so maybe there is a more elegant way?
Edit:
To clarify, the dataset is too big to fit into memory, so I definitely need to use generators/iterators.
Edit 2:
To further clarify, a and b should only be iterated over once, because they represent huge files that are slow to read.
Upvotes: 2
Views: 962
Reputation: 164773
Here is one solution via dictionaries. I provide it here as it's not clear to me that dictionaries are inefficient in this case.
I believe dict_of_lists
can be replaced by an iterator, but I use it in the below solution for demonstration purposes.
a = [(1, 'a'), (2, 't'), (4, 'c')]
b = [(1, 'a'), (3, 'g'), (4, 'g')]
dict_of_lists = {'a': a, 'b': b}
def gen_results(dict_of_lists):
keys = {num for k, v in dict_of_lists.items() \
for num, val in v}
for key in keys:
d = {k: val for k, v in dict_of_lists.items() \
for num, val in v if num == key}
yield (key, d)
Result
list(gen_results(dict_of_lists))
[(1, {'a': 'a', 'b': 'a'}),
(2, {'a': 't'}),
(3, {'b': 'g'}),
(4, {'a': 'c', 'b': 'g'})]
Upvotes: 2
Reputation: 71461
You can alter your groupby
solution by using reduce
and a generator in a function:
from itertools import groupby
from functools import reduce
def group_data(a, b):
sorted_data = sorted(a+b, key=lambda x:x[0])
data = [reduce(lambda x, y:(*x, y[-1]), list(b)) for _, b in groupby(sorted_data, key=lambda x:x[0])]
current = iter(range(len(list(filter(lambda x:len(x) == 2, data)))))
yield from [i if len(i) == 3 else (*i, None) if next(current)%2 == 0 else (i[0], None, i[-1]) for i in data]
print(list(group_data([(1, 'a'), (2, 't'), (4, 'c')], [(1, 'a'), (3, 'g'), (4, 'g')])))
Output:
[(1, 'a', 'a'), (2, 't', None), (3, None, 'g'), (4, 'c', 'g')]
Upvotes: 2