PV8
PV8

Reputation: 6270

Way to merge dataframe based on several ranges

the question is quite similiar to this one: Fastest way to merge pandas dataframe on ranges

but I have several ranges to respect in my joining process.

I have a dataframe A:

   ip_address  server_port
0   13         2
1   5          2
2   20         4
3   11         4
.. ........

and a dataframe B:

    lowerbound_ip_address   upperbound_ip_address   server_low  server_high          country
0    0                       10                     2          3                    Australia
1    11                      20                     2          3                   China
2    11                      20                     4          7                   Belgium

How can i merge it that the rules are respected:

   ip_address  server_port   country
0   13         2             China
1   5          2             Australia
2   20         4             Belgium
3   11         4             Belgium

My original idea is to write a loop for it, but is there a vectorized solution for it?

Upvotes: 0

Views: 180

Answers (3)

sammywemmy
sammywemmy

Reputation: 28709

One option is with the conditional_join from pyjanitor.

# pip install pyjanitor
import pandas as pd
import janitor

(A
.conditional_join(
    B, 
    ('ip_address', 'lowerbound_ip_address', '>='), 
    ('ip_address', 'upperbound_ip_address', '<='), 
    ('server_port', 'server_low', '>='), 
    ('server_port', 'server_high', '<='))
.loc[:, ['ip_address', 'server_port', 'country']]
)

   ip_address  server_port    country
0          13            2      China
1           5            2  Australia
2          20            4    Belgium
3          11            4    Belgium

Upvotes: 1

stahamtan
stahamtan

Reputation: 858

Inspired by the Fastest way to merge pandas dataframe on ranges, using pd.IntervalIndex you could create multiple intervals, (two in this case; one for ip_address and one for server_port):

ip_intv = pd.IntervalIndex.from_arrays(df_b.lowerbound_ip_address.unique(), 
                             df_b.upperbound_ip_address.unique(), 
                             'both')

server_intv = pd.IntervalIndex.from_arrays(df_b.server_low.unique(), 
                             df_b.server_high.unique(), 
                             'both')

Then using pd.cut you will find corresponding intervals in both dataframes:

df_a['ip_intv'] = pd.cut(df_a.ip_address, ip_intv)
df_a['server_intv'] = pd.cut(df_a.server_port, server_intv)

df_b['ip_intv'] = pd.cut(df_b.lowerbound_ip_address, ip_intv)
df_b['server_intv'] = pd.cut(df_b.server_low, server_intv)

df_a.set_index(['ip_intv', 'server_intv'], inplace=True)
df_b.set_index(['ip_intv', 'server_intv'], inplace=True)

Finally, you will do join

df_a.join(df_b.country)

Out:

                      ip_address  server_port    country
ip_intv  server_intv                                    
[0, 10]  [2, 3]                5            2  Australia
[11, 20] [2, 3]               13            2      China
         [4, 7]               20            4    Belgium
         [4, 7]               11            4    Belgium

Upvotes: 1

Code Different
Code Different

Reputation: 93181

Using numpy broadcasting:

ip_low = df_b['lowerbound_ip_address'].values
ip_high = df_b['upperbound_ip_address'].values
port_low = df_b['server_low'].values
port_high = df_b['server_high'].values

ip = df_a['ip_address'].values[:, None]
port = df_a['server_port'].values[:, None]

mask = (ip_low <= ip) & (ip <= ip_high) & (port_low <= port) & (port <= port_high)
id_b = np.argmax(mask, axis=1)
df_a.assign(id_b=id_b).join(df_b, on='id_b')

Result:

   ip_address  server_port  id_b  lowerbound_ip_address  upperbound_ip_address  server_low  server_high    country
0          13            2     1                     11                     20           2            3      China
1           5            2     0                      0                     10           2            3  Australia
2          20            4     2                     11                     20           4            7    Belgium
3          11            4     2                     11                     20           4            7    Belgium

How it works

It uses array broadcasting to find where each line in df_a matches to in df_b. The result is stored in mask which looks like this:

array([[False,  True, False],   # line 0 in df_a matches to line 1 in df_b
       [ True, False, False],   #      1                         0
       [False, False,  True],   #      2                         2
       [False, False,  True]])  #      3                         2

numpy treats True/False as 1/0 so you can do comparisons on them. For each row, we want to find the index of the first True value. Since True == 1 > False == 0, we can use argmax to do that:

id_b = np.argmax(mask, axis=1)

array([1, 0, 2, 2], dtype=int64)

The final line is just assigning new column and joining the two frames together.

Upvotes: 3

Related Questions