membranepotential
membranepotential

Reputation: 141

How to do a full outer join / merge of iterators by key?

I have multiple sorted iterators that yield keyed data, representable by lists:

a = iter([(1, 'a'), (2, 't'), (4, 'c')])
b = iter([(1, 'a'), (3, 'g'), (4, 'g')])

I want to merge them, using the key and keeping track of which iterator had a value for a key. This should be equivalent to a full outer join in SQL:

>>> list(full_outer_join(a, b, key=lambda x: x[0]))
[(1, 'a', 'a'), (2, 't', None), (3, None, 'g'), (4, 'c', 'g')]

I tried using heapq.merge and itertools.groupby, but with merge I already lose information about the iterators:

>>> list(heapq.merge(a, b, key=lambda x: x[0]))
[(1, 'a'), (1, 'a'), (2, 't'), (3, 'g'), (4, 'c'), (4, 'g')]

So what I could use is a tag generator

def tagged(it, tag):
    for item in it:
        yield (tag, *x)

and merge the tagged iterators, group by the key and create a dict using the tag:

merged = merge(tagged(a, 'a'), tagged(b, 'b'), key=lambda x: x[1])
grouped = groupby(merged, key=lambda x: x[1])
[(key, {g[0]: g[2] for g in group}) for key, group in grouped]

Which gives me this usable output:

[(1, {'a': 'a', 'b': 'a'}),
 (2, {'a': 't'}),
 (3, {'b': 'g'}),
 (4, {'a': 'c', 'b': 'g'})]

However, I think creating dicts for every group is quite costly performance wise, so maybe there is a more elegant way?

Edit:

To clarify, the dataset is too big to fit into memory, so I definitely need to use generators/iterators.

Edit 2:

To further clarify, a and b should only be iterated over once, because they represent huge files that are slow to read.

Upvotes: 2

Views: 962

Answers (2)

jpp
jpp

Reputation: 164773

Here is one solution via dictionaries. I provide it here as it's not clear to me that dictionaries are inefficient in this case.

I believe dict_of_lists can be replaced by an iterator, but I use it in the below solution for demonstration purposes.

a = [(1, 'a'), (2, 't'), (4, 'c')]
b = [(1, 'a'), (3, 'g'), (4, 'g')]

dict_of_lists = {'a': a, 'b': b}

def gen_results(dict_of_lists):
    keys = {num for k, v in dict_of_lists.items() \
                for num, val in v}
    for key in keys:
        d = {k: val for k, v in dict_of_lists.items() \
                    for num, val in v if num == key}
        yield (key, d)

Result

list(gen_results(dict_of_lists))

[(1, {'a': 'a', 'b': 'a'}),
 (2, {'a': 't'}),
 (3, {'b': 'g'}),
 (4, {'a': 'c', 'b': 'g'})]

Upvotes: 2

Ajax1234
Ajax1234

Reputation: 71461

You can alter your groupby solution by using reduce and a generator in a function:

from itertools import groupby
from functools import reduce
def group_data(a, b):
   sorted_data = sorted(a+b, key=lambda x:x[0])
   data = [reduce(lambda x, y:(*x, y[-1]), list(b)) for _, b in groupby(sorted_data, key=lambda x:x[0])]
   current = iter(range(len(list(filter(lambda x:len(x) == 2, data)))))
   yield from [i if len(i) == 3 else (*i, None) if next(current)%2 == 0 else (i[0], None, i[-1]) for i in data]

print(list(group_data([(1, 'a'), (2, 't'), (4, 'c')], [(1, 'a'), (3, 'g'), (4, 'g')])))

Output:

[(1, 'a', 'a'), (2, 't', None), (3, None, 'g'), (4, 'c', 'g')]

Upvotes: 2

Related Questions