Chris
Chris

Reputation: 1353

Optimizing pandas filter inside apply function

I have a list of pairs--stored in a DataFrame--each pair having an 'a' column and a 'b' column. For each pair I want to return the 'b's that have the same 'a'. For example, given the following set of pairs:

     a    b
0    c    d
1    e    f
2    c    g
3    e    h 
4    i    j
5    e    k

I would like to end up with:

     a    b    equivalents
0    c    d    [g]
1    e    f    [h, k] 
2    c    g    [d]
3    e    h    [f, k]
4    i    j    []
5    e    k    [h, e]

I can do this with the following:

def equivalents(x):
    l = pairs[pairs["a"] == x["a"]]["b"].tolist()
    return l[1:] if l else l

pairs["equivalents"] = pairs.apply(equivalents, axis = 1)

But it is painfully slow on larger sets (e.g. 1 million plus pairs). Any suggestions how I could do this faster?

Upvotes: 2

Views: 307

Answers (1)

JohnE
JohnE

Reputation: 30404

I think this ought to be a bit faster. First, just add them up.

df['equiv'] = df.groupby('a')['b'].transform(sum)

   a  b equiv
0  c  d    dg
1  e  f   fhk
2  c  g    dg
3  e  h   fhk
4  i  j     j
5  e  k   fhk

Now convert to a list and remove whichever letter is already in column 'b'.

df.apply( lambda x: [ y for y in list( x.equiv ) if y != x.b ], axis=1 )

0       [g]
1    [h, k]
2       [d]
3    [f, k]
4        []
5    [f, h]

Upvotes: 2

Related Questions