nonamorando
nonamorando

Reputation: 1605

Numpy/Pandas: Merge two numpy arrays based on one array efficiently

I have two numpy arrays comprised of two-set tuples:

a = [(1, "alpha"), (2, 3), ...]
b = [(1, "zylo"), (1, "xen"), (2, "potato", ...]

The first element in the tuple is the identifier and shared between both arrays, so I want to create a new numpy array which looks like this:

[(1, "alpha", "zylo", "xen"), (2, 3, "potato"), etc...]

My current solution works, but it's way too inefficient for me. Looks like this:

aggregate_collection = []
for tuple_set in a:
  for tuple_set2 in b:
    if tuple_set[0] == tuple_set2[0] and other_condition:
      temp_tup = (tuple_set[0], other tuple values)
      aggregate_collection.append(temp_tup)

How can I do this efficiently?

Upvotes: 1

Views: 284

Answers (2)

hpaulj
hpaulj

Reputation: 231385

In [278]: a = [(1, "alpha"), (2, 3)]
     ...: b = [(1, "zylo"), (1, "xen"), (2, "potato")]
In [279]: a
Out[279]: [(1, 'alpha'), (2, 3)]
In [280]: b
Out[280]: [(1, 'zylo'), (1, 'xen'), (2, 'potato')]

Note that if I try to make an array from a I get something quite different.

In [281]: np.array(a)
Out[281]: 
array([['1', 'alpha'],
       ['2', '3']], dtype='<U21')
In [282]: _.shape
Out[282]: (2, 2)

defaultdict is a handy tool for collecting like-keyed values

In [283]: from collections import defaultdict
In [284]: dd = defaultdict(list)
In [285]: for tup in a+b:
     ...:     k,v = tup
     ...:     dd[k].append(v)
     ...: 
In [286]: dd
Out[286]: defaultdict(list, {1: ['alpha', 'zylo', 'xen'], 2: [3, 'potato']})

which can be cast as a list of tuples with:

In [288]: [(k,*v) for k,v in dd.items()]
Out[288]: [(1, 'alpha', 'zylo', 'xen'), (2, 3, 'potato')]

I'm using a+b to join the lists, since it apparently doesn't matter where the tuples occur.

Out[288] is even a poor numpy fit, since the tuples differ in size, and items (other than the first) might be strings or numbers.

Upvotes: 0

rafaelc
rafaelc

Reputation: 59274

I'd concatenate these into a data frame and just groupby+agg

(pd.concat([pd.DataFrame(a), pd.DataFrame(b)])
   .groupby(0)
   .agg(lambda s: [s.name, *s])[1])

where 0 and 1 are the default column names given by creating a dataframe via pd.DataFrame. Change it to your column names.

Upvotes: 2

Related Questions