sastrup
sastrup

Reputation: 189

Pandas merge dataframes on ip-address by range

I have two dataframes containing some ip information that I'd like to merge (equivalent to a left join in sql). The dataframes have the following fields:

df1: ["company","ip","actions"]  
df2: ["ip_range_start","ip_range_end","country","state","city"]

The result dataframe should have the headers: ["company","ip","actions","country","state","city"]. The problem here is my merge criteria. df1 contains a single ip that I'd like to use to pull the country, state, and city information from df2.

This single ip will fall into one of the ranges specified by df2's "ip_range_start" and "ip_range_end" fields. I'm not sure how to accomplish this as a normal merge/join will obviously not do the trick as there's no matching values between df1 and df2.

My question seems very similar to this one, but different enough to warrant a separate question: Pandas: how to merge two dataframes on offset dates?

Upvotes: 6

Views: 1651

Answers (2)

David Hall
David Hall

Reputation: 535

If you're willing to use R instead of Python, I've written an ipaddress package which can solve this problem.

Using the same data from MaxU's answer:

library(tidyverse)
library(ipaddress)
library(fuzzyjoin)

addr <- tibble(
  company = c("comp1", "comp2", "comp3", "comp4"),
  ip = ip_address(c("10.10.1.2", "10.10.2.20", "10.10.3.50", "10.10.4.100")),
  actions = c("act1", "act2", "act3", "act4")
)
nets <- tibble(
  ip_range_start = ip_address(c("10.10.2.1", "10.10.3.1", "10.10.4.1")),
  ip_range_end = ip_address(c("10.10.2.254", "10.10.3.254", "10.10.4.254")),
  country = c("country2", "country3", "country4"),
  state = c("state2", "state3", "state4"),
  city = c("city2", "city3", "city4")
)

nets <- nets %>%
  mutate(network = common_network(ip_range_start, ip_range_end)) %>%
  select(network, country, state, city)

fuzzy_left_join(addr, nets, c("ip" = "network"), is_within)
#> # A tibble: 4 x 7
#>   company          ip actions      network country  state  city 
#>   <chr>     <ip_addr> <chr>     <ip_netwk> <chr>    <chr>  <chr>
#> 1 comp1     10.10.1.2 act1              NA <NA>     <NA>   <NA> 
#> 2 comp2    10.10.2.20 act2    10.10.2.0/24 country2 state2 city2
#> 3 comp3    10.10.3.50 act3    10.10.3.0/24 country3 state3 city3
#> 4 comp4   10.10.4.100 act4    10.10.4.0/24 country4 state4 city4

Using the same benchmark of 4 million addresses, network membership checks are completed in 3.25 seconds.

big <- tibble(ip = rep(addr$ip, 1e6))
big
#> # A tibble: 4,000,000 x 1
#>             ip
#>      <ip_addr>
#>  1   10.10.1.2
#>  2  10.10.2.20
#>  3  10.10.3.50
#>  4 10.10.4.100
#>  5   10.10.1.2
#>  6  10.10.2.20
#>  7  10.10.3.50
#>  8 10.10.4.100
#>  9   10.10.1.2
#> 10  10.10.2.20
#> # … with 3,999,990 more rows
bench::mark(fuzzy_left_join(big, nets, c("ip" = "network"), is_within))$median
#> [1] 3.25s

Upvotes: 0

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

assume you have the following data frames:

In [5]: df1
Out[5]:
  company           ip actions
0   comp1    10.10.1.2    act1
1   comp2   10.10.2.20    act2
2   comp3   10.10.3.50    act3
3   comp4  10.10.4.100    act4

In [6]: df2
Out[6]:
  ip_range_start ip_range_end   country   state   city
0      10.10.2.1  10.10.2.254  country2  state2  city2
1      10.10.3.1  10.10.3.254  country3  state3  city3
2      10.10.4.1  10.10.4.254  country4  state4  city4

we can create a vectorized function which will calculate numerical IP representation similar to int(netaddr.IPAddress('192.0.2.1')):

def ip_to_int(ip_ser):
    ips = ip_ser.str.split('.', expand=True).astype(np.int16).values
    mults = np.tile(np.array([24, 16, 8, 0]), len(ip_ser)).reshape(ips.shape)
    return np.sum(np.left_shift(ips, mults), axis=1)

let's convert all IPs to their numerical representations:

df1['_ip'] = ip_to_int(df1.ip)
df2[['_ip_range_start','_ip_range_end']] = df2.filter(like='ip_range').apply(lambda x: ip_to_int(x))

In [10]: df1
Out[10]:
  company           ip actions        _ip
0   comp1    10.10.1.2    act1  168427778
1   comp2   10.10.2.20    act2  168428052
2   comp3   10.10.3.50    act3  168428338
3   comp4  10.10.4.100    act4  168428644

In [11]: df2
Out[11]:
  ip_range_start ip_range_end   country   state   city  _ip_range_start  _ip_range_end
0      10.10.2.1  10.10.2.254  country2  state2  city2        168428033      168428286
1      10.10.3.1  10.10.3.254  country3  state3  city3        168428289      168428542
2      10.10.4.1  10.10.4.254  country4  state4  city4        168428545      168428798

now let's add a new column to the df1 DF, which will contain an index of the first matching IP interval from the df2 DF:

In [12]: df1['x'] = (df1._ip.apply(lambda x: df2.query('_ip_range_start <= @x <= _ip_range_end')
   ....:                                       .index
   ....:                                       .values)
   ....:                   .apply(lambda x: x[0] if len(x) else -1))

In [14]: df1
Out[14]:
  company           ip actions        _ip  x
0   comp1    10.10.1.2    act1  168427778 -1
1   comp2   10.10.2.20    act2  168428052  0
2   comp3   10.10.3.50    act3  168428338  1
3   comp4  10.10.4.100    act4  168428644  2

finally we can merge both DFs:

In [15]: (pd.merge(df1.drop('_ip',1),
   ....:           df2.filter(regex=r'^((?!.?ip_range_).*)$'),
   ....:           left_on='x',
   ....:           right_index=True,
   ....:           how='left')
   ....:    .drop('x',1)
   ....: )
Out[15]:
  company           ip actions   country   state   city
0   comp1    10.10.1.2    act1       NaN     NaN    NaN
1   comp2   10.10.2.20    act2  country2  state2  city2
2   comp3   10.10.3.50    act3  country3  state3  city3
3   comp4  10.10.4.100    act4  country4  state4  city4

Let's compare the speed of the standard int(IPAddress) to our function (we will use 4M rows DF for the comparison):

In [21]: big = pd.concat([df1.ip] * 10**6, ignore_index=True)

In [22]: big.shape
Out[22]: (4000000,)

In [23]: big.head(10)
Out[23]:
0      10.10.1.2
1     10.10.2.20
2     10.10.3.50
3    10.10.4.100
4      10.10.1.2
5     10.10.2.20
6     10.10.3.50
7    10.10.4.100
8      10.10.1.2
9     10.10.2.20
Name: ip, dtype: object

In [24]: %timeit
%timeit  %%timeit

In [24]: %timeit big.apply(lambda x: int(IPAddress(x)))
1 loop, best of 3: 1min 3s per loop

In [25]: %timeit ip_to_int(big)
1 loop, best of 3: 25.4 s per loop

Conclusion: our function is approx. 2.5 times faster

Upvotes: 4

Related Questions