Reputation: 860
I have a task kind of like SQL search. I have a "table" which contains the following 1D arrays (about 1 million elements) identified by ID1
:
ID1, z, e, PA, n
Another "table" which contains the following 1D arrays (about 1.5 million elements) identified by ID2
:
ID2, RA, DEC
I want to match ID1
and ID2
to find the common ones to form another "table" which contains ID, z, e, PA, n, RA, DEC
. Most elements in ID1
can be found in ID2
but not all, otherwise I can use numpy.in1d(ID1,ID2)
to accomplish it. Anyone has fast way to accomplish this task?
For example:
ID1, z, e, PA, n
101, 1.0, 1.2, 1.5, 1.8
104, 1.5, 1.8, 2.2, 3.1
105, 1.4, 2.0, 3.3, 2.8
ID2, RA, DEC
101, 4.5, 10.5
107, 90.1, 55.5
102, 30.5, 3.3
103, 60.1, 40.6
104, 10.8, 5.6
The output should be
ID, z, e, PA, n, RA, DEC
101, 1.0, 1.2, 1.5, 1.8, 4.5, 10.5
104, 1.5, 1.8, 2.2, 3.1, 10.8, 5.6
Upvotes: 3
Views: 2876
Reputation: 2163
Use pandas:
import pandas as pd
id1 = pd.read_csv('id1.txt')
id2 = pd.read_csv('id2.txt')
df = id1.merge(id2.sort_values(by='ID2').drop_duplicates('ID2').rename(columns={'ID2':'ID1'}))
print(df)
Produces:
ID1 z e PA n RA DEC
0 101 1.0 1.2 1.5 1.8 4.5 10.5
1 104 1.5 1.8 2.2 3.1 10.8 5.6
With large datasets you may need to do things in place:
# [Optional] sort locations and drop duplicates
id2.sort_values(by='ID2', inplace=True)
id2.drop_duplicates('ID2', inplace=True)
# columns that you are merging must have the same name
id2.rename(columns={'ID2':'ID1'}, inplace=True)
# perform the merge
df = id1.merge(id2)
Without drop_duplicates you get one row for each item:
df = id1.merge(id2.rename(columns={'ID2':'ID1'}))
print(id2)
print(df)
Giving:
ID2 RA DEC
0 101 4.5 10.5
1 107 90.1 55.5
2 102 30.5 3.3
3 103 60.1 40.6
4 104 10.8 5.6
5 103 60.1 40.6
6 104 10.9 5.6
ID1 z e PA n RA DEC
0 101 1.0 1.2 1.5 1.8 4.5 10.5
1 104 1.5 1.8 2.2 3.1 10.8 5.6
2 104 1.5 1.8 2.2 3.1 10.9 5.6
Note that this solution preserves the different types for the columns:
>>> id1.ID1.dtype
dtype('int64')
>>> id1[' z'].dtype
dtype('float64')
Since you have spaces after the comma in the header row those spaces became part of the column name, hence need to refer to the second column using id1[' z']. By modifying the read statement, this is no longer necessary:
>>> id1 = pd.read_csv('id1.txt', skipinitialspace=True)
>>> id1.z.dtype
dtype('float64')
Upvotes: 1
Reputation: 2163
Assuming your second table, table B, is sorted, you can do a sorted lookup, then check if the indexed element is actually found:
idx = np.searchsorted(B[:-1, 0], A[:, 0])
found = A[:, 0] == B[idx, 0]
np.hstack((A[found, :], B[idx[found], 1:]))
Result:
array([[ 101. , 1. , 1.2, 1.5, 1.8, 4.5, 10.5],
[ 104. , 1.5, 1.8, 2.2, 3.1, 10.8, 5.6]])
The last element of the B indices is excluded to simplify the case where the item in A is beyond the final element in B. Without it, it is possible that the returned index would be greater than the length of B and cause indexing errors.
Upvotes: 1
Reputation: 221674
Well you can use np.in1d
with swapped places for the first columns of the two arrays/tables, such that we would have two masks to index into the arrays for selection. Then, simply stack the results -
mask1 = np.in1d(a[:,0], b[:,0])
mask2 = np.in1d(b[:,0], a[:,0])
out = np.column_stack(( a[mask1], b[mask2,1:] ))
Sample run -
In [44]: a
Out[44]:
array([[ 101. , 1. , 1.2, 1.5, 1.8],
[ 104. , 1.5, 1.8, 2.2, 3.1],
[ 105. , 1.4, 2. , 3.3, 2.8]])
In [45]: b
Out[45]:
array([[ 101. , 4.5, 10.5],
[ 102. , 30.5, 3.3],
[ 103. , 60.1, 40.6],
[ 104. , 10.8, 5.6],
[ 107. , 90.1, 55.5]])
In [46]: mask1 = np.in1d(a[:,0], b[:,0])
In [47]: mask2 = np.in1d(b[:,0], a[:,0])
In [48]: np.column_stack(( a[mask1], b[mask2,1:] ))
Out[48]:
array([[ 101. , 1. , 1.2, 1.5, 1.8, 4.5, 10.5],
[ 104. , 1.5, 1.8, 2.2, 3.1, 10.8, 5.6]])
Upvotes: 1