Jeremy
Jeremy

Reputation: 2000

Why is getting the reverse of an index in pandas so slow?

I have a pandas dataframe that I'm using to store network data; it looks like:

from_id, to_id, count
X, Y, 3
Z, Y, 4
Y, X, 2
...

I am trying to add a new column, inverse_count, which gets the count value for the row where the from_id and to_id are reversed from the current row.

I'm taking the following approach. I thought that it would be fast but it is much slower than I anticipated, and I can't figure out why.

def get_inverse_val(x):
    # Takes the inverse of the index for a given row
    # When passed to apply with axis = 1, the index becomes the name
    try:
        return df.loc[(x.name[1], x.name[0]), 'count']
    except KeyError:
        return 0

df = df.set_index(['from_id', 'to_id'])

df['inverse_count'] = df.apply(get_inverse_val, axis = 1)

Upvotes: 0

Views: 173

Answers (2)

Ben
Ben

Reputation: 21625

Why not do a simple merge for this?

df = pd.DataFrame({'from_id': ['X', 'Z', 'Y'], 'to_id': ['Y', 'Y', 'X'], 'count': [3,4,2]})

pd.merge(
  left = df, 
  right = df, 
  how = 'left', 
  left_on = ['from_id', 'to_id'], 
  right_on = ['to_id', 'from_id']
)

  from_id_x to_id_x  count_x from_id_y to_id_y  count_y
0         X       Y        3         Y       X      2.0
1         Z       Y        4       NaN     NaN      NaN
2         Y       X        2         X       Y      3.0

Here we merge from (from, to) -> (to, from) to get reversed matching pairs. In general, you should avoid using apply() as it's slow. (To understand why, realized that it is not a vectorized operation.)

Upvotes: 6

Craig
Craig

Reputation: 4855

You can use .set_index twice to create two dataframes with opposite index orders and assign to create your inverse_count column.

df = (df.set_index(['from_id','to_id'])
        .assign(inverse_count=df.set_index(['to_id','from_id'])['count'])
        .reset_index())

  from_id to_id  count  inverse_count
0       X     Y      3            2.0
1       Z     Y      4            NaN
2       Y     X      2            3.0

Since the question was regarding speed let's look at performance on a larger dataset:

Setup:

import pandas as pd
import string
import itertools

df = pd.DataFrame(list(itertools.permutations(string.ascii_uppercase, 2)), columns=['from_id', 'to_id'])
df['count'] = df.index % 25 + 1

print(df)

    from_id to_id  count
0         A     B      1
1         A     C      2
2         A     D      3
3         A     E      4
4         A     F      5
..      ...   ...    ...
645       Z     U     21
646       Z     V     22
647       Z     W     23
648       Z     X     24
649       Z     Y     25

Set_index:

%timeit (df.set_index(['from_id','to_id'])
           .assign(inverse_count=df.set_index(['to_id','from_id'])['count'])
           .reset_index())
6 ms ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Merge (from Ben's answer):

%timeit pd.merge(
          left = df,
          right = df,
          how = 'left',
          left_on = ['from_id', 'to_id'],
          right_on = ['to_id', 'from_id'] )
1.73 ms ± 57.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So, it looks like the merge approach is the faster option.

Upvotes: 2

Related Questions