Reputation: 2000
I have a pandas dataframe that I'm using to store network data; it looks like:
from_id, to_id, count
X, Y, 3
Z, Y, 4
Y, X, 2
...
I am trying to add a new column, inverse_count
, which gets the count
value for the row where the from_id
and to_id
are reversed from the current row.
I'm taking the following approach. I thought that it would be fast but it is much slower than I anticipated, and I can't figure out why.
def get_inverse_val(x):
# Takes the inverse of the index for a given row
# When passed to apply with axis = 1, the index becomes the name
try:
return df.loc[(x.name[1], x.name[0]), 'count']
except KeyError:
return 0
df = df.set_index(['from_id', 'to_id'])
df['inverse_count'] = df.apply(get_inverse_val, axis = 1)
Upvotes: 0
Views: 173
Reputation: 21625
Why not do a simple merge for this?
df = pd.DataFrame({'from_id': ['X', 'Z', 'Y'], 'to_id': ['Y', 'Y', 'X'], 'count': [3,4,2]})
pd.merge(
left = df,
right = df,
how = 'left',
left_on = ['from_id', 'to_id'],
right_on = ['to_id', 'from_id']
)
from_id_x to_id_x count_x from_id_y to_id_y count_y
0 X Y 3 Y X 2.0
1 Z Y 4 NaN NaN NaN
2 Y X 2 X Y 3.0
Here we merge from (from, to) -> (to, from) to get reversed matching pairs. In general, you should avoid using apply()
as it's slow. (To understand why, realized that it is not a vectorized operation.)
Upvotes: 6
Reputation: 4855
You can use .set_index
twice to create two dataframes with opposite index orders and assign to create your inverse_count column.
df = (df.set_index(['from_id','to_id'])
.assign(inverse_count=df.set_index(['to_id','from_id'])['count'])
.reset_index())
from_id to_id count inverse_count
0 X Y 3 2.0
1 Z Y 4 NaN
2 Y X 2 3.0
Since the question was regarding speed let's look at performance on a larger dataset:
Setup:
import pandas as pd
import string
import itertools
df = pd.DataFrame(list(itertools.permutations(string.ascii_uppercase, 2)), columns=['from_id', 'to_id'])
df['count'] = df.index % 25 + 1
print(df)
from_id to_id count
0 A B 1
1 A C 2
2 A D 3
3 A E 4
4 A F 5
.. ... ... ...
645 Z U 21
646 Z V 22
647 Z W 23
648 Z X 24
649 Z Y 25
Set_index:
%timeit (df.set_index(['from_id','to_id'])
.assign(inverse_count=df.set_index(['to_id','from_id'])['count'])
.reset_index())
6 ms ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Merge (from Ben's answer):
%timeit pd.merge(
left = df,
right = df,
how = 'left',
left_on = ['from_id', 'to_id'],
right_on = ['to_id', 'from_id'] )
1.73 ms ± 57.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, it looks like the merge approach is the faster option.
Upvotes: 2