Huanian Zhang
Huanian Zhang

Reputation: 860

Match two numpy arrays to find the same elements

I have a task kind of like SQL search. I have a "table" which contains the following 1D arrays (about 1 million elements) identified by ID1:

ID1, z, e, PA, n

Another "table" which contains the following 1D arrays (about 1.5 million elements) identified by ID2:

ID2, RA, DEC

I want to match ID1 and ID2 to find the common ones to form another "table" which contains ID, z, e, PA, n, RA, DEC. Most elements in ID1 can be found in ID2 but not all, otherwise I can use numpy.in1d(ID1,ID2) to accomplish it. Anyone has fast way to accomplish this task?

For example:

ID1, z, e, PA, n
101, 1.0, 1.2, 1.5, 1.8
104, 1.5, 1.8, 2.2, 3.1
105, 1.4, 2.0, 3.3, 2.8

ID2, RA, DEC
101, 4.5, 10.5
107, 90.1, 55.5
102, 30.5, 3.3
103, 60.1, 40.6
104, 10.8, 5.6

The output should be

ID, z, e, PA, n, RA, DEC
101, 1.0, 1.2, 1.5, 1.8, 4.5, 10.5
104, 1.5, 1.8, 2.2, 3.1, 10.8, 5.6

Upvotes: 3

Views: 2876

Answers (3)

Neapolitan
Neapolitan

Reputation: 2163

Use pandas:

import pandas as pd

id1 = pd.read_csv('id1.txt')
id2 = pd.read_csv('id2.txt')
df = id1.merge(id2.sort_values(by='ID2').drop_duplicates('ID2').rename(columns={'ID2':'ID1'}))
print(df)

Produces:

   ID1    z    e   PA    n    RA   DEC
0  101  1.0  1.2  1.5  1.8   4.5  10.5
1  104  1.5  1.8  2.2  3.1  10.8   5.6

With large datasets you may need to do things in place:

# [Optional] sort locations and drop duplicates
id2.sort_values(by='ID2', inplace=True)
id2.drop_duplicates('ID2', inplace=True)

# columns that you are merging must have the same name
id2.rename(columns={'ID2':'ID1'}, inplace=True)

# perform the merge
df = id1.merge(id2)

Without drop_duplicates you get one row for each item:

df = id1.merge(id2.rename(columns={'ID2':'ID1'}))
print(id2)
print(df)

Giving:

   ID2    RA   DEC
0  101   4.5  10.5
1  107  90.1  55.5
2  102  30.5   3.3
3  103  60.1  40.6
4  104  10.8   5.6
5  103  60.1  40.6
6  104  10.9   5.6
   ID1    z    e   PA    n    RA   DEC
0  101  1.0  1.2  1.5  1.8   4.5  10.5
1  104  1.5  1.8  2.2  3.1  10.8   5.6
2  104  1.5  1.8  2.2  3.1  10.9   5.6

Note that this solution preserves the different types for the columns:

>>> id1.ID1.dtype
dtype('int64')
>>> id1[' z'].dtype
dtype('float64')

Since you have spaces after the comma in the header row those spaces became part of the column name, hence need to refer to the second column using id1[' z']. By modifying the read statement, this is no longer necessary:

>>> id1 = pd.read_csv('id1.txt', skipinitialspace=True)
>>> id1.z.dtype
dtype('float64')

Upvotes: 1

Neapolitan
Neapolitan

Reputation: 2163

Assuming your second table, table B, is sorted, you can do a sorted lookup, then check if the indexed element is actually found:

idx = np.searchsorted(B[:-1, 0], A[:, 0])
found = A[:, 0] == B[idx, 0]
np.hstack((A[found, :], B[idx[found], 1:]))

Result:

array([[ 101. ,    1. ,    1.2,    1.5,    1.8,    4.5,   10.5],
       [ 104. ,    1.5,    1.8,    2.2,    3.1,   10.8,    5.6]])

The last element of the B indices is excluded to simplify the case where the item in A is beyond the final element in B. Without it, it is possible that the returned index would be greater than the length of B and cause indexing errors.

Upvotes: 1

Divakar
Divakar

Reputation: 221674

Well you can use np.in1d with swapped places for the first columns of the two arrays/tables, such that we would have two masks to index into the arrays for selection. Then, simply stack the results -

mask1 = np.in1d(a[:,0], b[:,0])
mask2 = np.in1d(b[:,0], a[:,0])
out = np.column_stack(( a[mask1], b[mask2,1:] ))

Sample run -

In [44]: a
Out[44]: 
array([[ 101. ,    1. ,    1.2,    1.5,    1.8],
       [ 104. ,    1.5,    1.8,    2.2,    3.1],
       [ 105. ,    1.4,    2. ,    3.3,    2.8]])

In [45]: b
Out[45]: 
array([[ 101. ,    4.5,   10.5],
       [ 102. ,   30.5,    3.3],
       [ 103. ,   60.1,   40.6],
       [ 104. ,   10.8,    5.6],
       [ 107. ,   90.1,   55.5]])

In [46]: mask1 = np.in1d(a[:,0], b[:,0])

In [47]: mask2 = np.in1d(b[:,0], a[:,0])

In [48]: np.column_stack(( a[mask1], b[mask2,1:] ))
Out[48]: 
array([[ 101. ,    1. ,    1.2,    1.5,    1.8,    4.5,   10.5],
       [ 104. ,    1.5,    1.8,    2.2,    3.1,   10.8,    5.6]])

Upvotes: 1

Related Questions