Reputation: 6270
the question is quite similiar to this one: Fastest way to merge pandas dataframe on ranges
but I have several ranges to respect in my joining process.
I have a dataframe A:
ip_address server_port
0 13 2
1 5 2
2 20 4
3 11 4
.. ........
and a dataframe B:
lowerbound_ip_address upperbound_ip_address server_low server_high country
0 0 10 2 3 Australia
1 11 20 2 3 China
2 11 20 4 7 Belgium
How can i merge it that the rules are respected:
ip_address server_port country
0 13 2 China
1 5 2 Australia
2 20 4 Belgium
3 11 4 Belgium
My original idea is to write a loop for it, but is there a vectorized solution for it?
Upvotes: 0
Views: 180
Reputation: 28709
One option is with the conditional_join from pyjanitor.
# pip install pyjanitor
import pandas as pd
import janitor
(A
.conditional_join(
B,
('ip_address', 'lowerbound_ip_address', '>='),
('ip_address', 'upperbound_ip_address', '<='),
('server_port', 'server_low', '>='),
('server_port', 'server_high', '<='))
.loc[:, ['ip_address', 'server_port', 'country']]
)
ip_address server_port country
0 13 2 China
1 5 2 Australia
2 20 4 Belgium
3 11 4 Belgium
Upvotes: 1
Reputation: 858
Inspired by the Fastest way to merge pandas dataframe on ranges, using pd.IntervalIndex
you could create multiple intervals, (two in this case; one for ip_address
and one for server_port
):
ip_intv = pd.IntervalIndex.from_arrays(df_b.lowerbound_ip_address.unique(),
df_b.upperbound_ip_address.unique(),
'both')
server_intv = pd.IntervalIndex.from_arrays(df_b.server_low.unique(),
df_b.server_high.unique(),
'both')
Then using pd.cut
you will find corresponding intervals in both dataframes:
df_a['ip_intv'] = pd.cut(df_a.ip_address, ip_intv)
df_a['server_intv'] = pd.cut(df_a.server_port, server_intv)
df_b['ip_intv'] = pd.cut(df_b.lowerbound_ip_address, ip_intv)
df_b['server_intv'] = pd.cut(df_b.server_low, server_intv)
df_a.set_index(['ip_intv', 'server_intv'], inplace=True)
df_b.set_index(['ip_intv', 'server_intv'], inplace=True)
Finally, you will do join
df_a.join(df_b.country)
Out:
ip_address server_port country
ip_intv server_intv
[0, 10] [2, 3] 5 2 Australia
[11, 20] [2, 3] 13 2 China
[4, 7] 20 4 Belgium
[4, 7] 11 4 Belgium
Upvotes: 1
Reputation: 93181
Using numpy broadcasting:
ip_low = df_b['lowerbound_ip_address'].values
ip_high = df_b['upperbound_ip_address'].values
port_low = df_b['server_low'].values
port_high = df_b['server_high'].values
ip = df_a['ip_address'].values[:, None]
port = df_a['server_port'].values[:, None]
mask = (ip_low <= ip) & (ip <= ip_high) & (port_low <= port) & (port <= port_high)
id_b = np.argmax(mask, axis=1)
df_a.assign(id_b=id_b).join(df_b, on='id_b')
Result:
ip_address server_port id_b lowerbound_ip_address upperbound_ip_address server_low server_high country
0 13 2 1 11 20 2 3 China
1 5 2 0 0 10 2 3 Australia
2 20 4 2 11 20 4 7 Belgium
3 11 4 2 11 20 4 7 Belgium
It uses array broadcasting to find where each line in df_a
matches to in df_b
. The result is stored in mask
which looks like this:
array([[False, True, False], # line 0 in df_a matches to line 1 in df_b
[ True, False, False], # 1 0
[False, False, True], # 2 2
[False, False, True]]) # 3 2
numpy treats True/False
as 1/0
so you can do comparisons on them. For each row, we want to find the index of the first True
value. Since True == 1
> False == 0
, we can use argmax
to do that:
id_b = np.argmax(mask, axis=1)
array([1, 0, 2, 2], dtype=int64)
The final line is just assigning new column and joining the two frames together.
Upvotes: 3