Reputation: 549

Python Merge Two Numpy Arrays Based on Condition

How can I merge the following two arrays, by looking up a value from array A in array B?

Array A:

array([['GG', 'AB', IPv4Network('1.2.3.41/26')],
       ['GG', 'AC', IPv4Network('1.2.3.42/25')],
       ['GG', 'AD', IPv4Network('1.2.3.43/24')],
       ['GG', 'AE', IPv4Network('1.2.3.47/23')],
       ['GG', 'AF', IPv4Network('1.2.3.5/24')]],
      dtype=object)

and Array B:

array([['123456', 'A1', IPv4Address('1.2.3.5'), nan],
       ['987654', 'B1', IPv4Address('1.2.3.47'), nan]],
      dtype=object)

The goal here is to create Array C, by looking up the IPv4Address from Array B in Array A and comparing them, and getting the corresponding array's second value and storing it:

Array C:

array([['123456', 'A1', IPv4Address('1.2.3.5'), nan, 'AF'],
       ['987654', 'B1', IPv4Address('1.2.3.47'), nan, 'AE']],
      dtype=object)

The ip addresses are of this type: https://docs.python.org/3/library/ipaddress.html#ipaddress.ip_network

How can I achieve this?

edit:

Please note that the merging is conditioned on the IPs matching, so the resulting array C will have the same number of arrays as the Array B, but it will have one more value. The suggested duplicate links are not answering the same question.

Upvotes: 7

Answers (3)

mrk

Reputation: 10406

This should do what you asked for (at least the output is exactly what you wanted), I made some minor assumptions to deal with your #dummydata, but that should not matter too much.

Code:

import numpy as np
import ipaddress as ip

array_A = np.array([['GG', 'AB', ip.ip_network('192.168.0.0/32')],
                    ['GG', 'AC', ip.ip_network('192.168.0.0/31')],
                    ['GG', 'AD', ip.ip_network('192.168.0.0/30')],
                    ['GG', 'AE', ip.ip_network('192.168.0.0/29')],
                    ['GG', 'AF', ip.ip_network('192.168.0.0/28')]],
                   dtype=object)

array_B = np.array([['123456', 'A1', ip.ip_network('192.168.0.0/28'), np.nan],
                    ['987654', 'B1', ip.ip_network('192.168.0.0/29'), np.nan]],
                   dtype=object)


def merge_by_ip(A, B):
    # initializing an empty array with len(B) rows and 5 columns for the values you want to save in it
    C = np.empty([len(B), 5],dtype=object)
    for n in range(len(B)):
        for a in A:
            # checking condition: if ip address in a is ip address in b
            if a[2] == B[n][2]:
                # add the entry of b with the second value of a to the new Array c
                C[n] = np.append(B[n], a[1])
    return C


print(merge_by_ip(array_A, array_B))

Output:

[['123456' 'A1' IPv4Network('192.168.0.0/28') nan 'AF']
 ['987654' 'B1' IPv4Network('192.168.0.0/29') nan 'AE']]

Note:

This solution has O(m * n) complexity, which isn't necessary, there are many out-of-the box (Pandas) and custom (e.g. using dict) ways to merge with lower complexity.

Upvotes: 6

Grismar

Reputation: 31354

There are issues with your data and complications that prevent you from using join_by or rec_join as the question you linked suggested.

The main issue with your data, as pointed out by others, is that networks like IPv4Network('1.2.3.4/24') are not valid networks, since they have host bits set that are masked out by the /24. The /24 means that the final 32 - 24 = 8 bits are your hosts bits and the constructor for IPv4Network requires that these are set to 0, for example IPv4Network('1.2.3.0/24') is valid.

The main complication is that you have networks in one array, but addresses in the other. Methods like rec_join and join_by use comparison (i.e. ==) to decide what records go together. Some of the other proposed answers 'solve' this by replacing your networks with addresses, but that doesn't appear to be your problem.

Also, note that a single network address may fall into multiple different networks. For example, IPv4Address('1.2.3.129') falls within both IPv4Network('1.2.3.0/24') and IPv4Network('1.2.3.128/25'). So, I assume you'd expect both matches to show up in your results.

To join addresses from one array to networks that the address actually falls within, you'll have to iterate over the array yourself and construct a new one. The type of comparison that works is IPv4Address('1.2.3.129') in IPv4Network('1.2.3.0/24') (this is True).

A working code example bringing this together:

from numpy import nan, asarray, concatenate
from ipaddress import IPv4Address, IPv4Network

a = asarray([
    ['GG', 'AA', IPv4Network('1.2.4.0/24')],
    ['GG', 'AB', IPv4Network('1.2.3.128/25')],
    ['GG', 'AC', IPv4Network('1.2.3.0/24')]
], dtype=object)

b = asarray([
    ['123456', 'A1', IPv4Address('1.2.3.4'), nan],
    ['987654', 'B1', IPv4Address('1.2.3.129'), nan],
    ['024680', 'C1', IPv4Address('1.2.4.0'), nan]
], dtype=object)


def join_addresses_networks(addresses, networks):
    for address in addresses:
        for network in networks:
            if address[2] in network[2]:
                yield concatenate((address, network[:-1]))


c = asarray(list(join_addresses_networks(b, a)))

print(c)

Upvotes: 0

jpp

Reputation: 164773

There appears no reason you can't use Pandas. If your IP addresses align perfectly, you can merge and then use pd.DataFrame.values to return a NumPy array:

import pandas as pd

# data from @mk18
df_A = pd.DataFrame(array_A[:, 1:], columns=['', 'IP'])
df_B = pd.DataFrame(array_B, columns=['id', 'value', 'IP', 'na'])

res = df_B.merge(df_A, on='IP').values

print(res)

array([['123456', 'A1', IPv4Network('192.168.0.0/28'), nan, 'AF'],
       ['987654', 'B1', IPv4Network('192.168.0.0/29'), nan, 'AE']],
      dtype=object)

If you wish to ignore the network component and include only the network_address when merging, i.e. use '1.2.3.5' instead of '1.2.3.5/24', then you can create helper series before merging:

import pandas as pd
from operator import attrgetter

df_A = pd.DataFrame(array_A[:, 1:], columns=['key', 'IP'])
df_B = pd.DataFrame(array_B, columns=['id', 'value', 'IP', 'na'])

df_A['IP_NoNetwork'] = df_A['IP'].map(attrgetter('network_address'))
df_B['IP_NoNetwork'] = df_B['IP'].map(attrgetter('network_address'))

res = df_B.merge(df_A.drop('IP', 1), on='IP_NoNetwork')\
          .loc[:, ['id', 'value', 'IP', 'na', 'key']].values

Upvotes: 3

Python Merge Two Numpy Arrays Based on Condition

edit:

Answers (3)

Related Questions