Reputation: 3521
I have a Pandas Dataframe
D
storing a large database. I also have a smaller DataFrame
C
with 10 rows, containing exactly the same columns as the main one, including column 'price
'. For each row r
in the main dataframe D
I want to find a row t
in C
which is closest, i.e., the difference between r.price
and t.price
is minimal.
As an output I want to have:
I have a function computing the difference
def difference(self, row1, row2):
d = np.abs(row1["price"] - row2["price"])
return d
And I want to use apply function to apply the difference function to each row in C, for each row v in D:
for _, v in D.iterrows():
C.apply(self.difference, axis=1, args=(v,))
But I don't know how I should find the row of C
for which the difference was minimal. I was thinkking about the min
build-in function from Python, but I don't know how to apply it correctly for dataframes.
An example: Let say I have a data D
id | name | price
1. bdb | AAA | 2.34
2. 441 | BBB | 3.56
3. ae9 | CCC | 1.27
4. fb9 | DDD | 9.78
5. e6b | EEE | 5.13
6. bx4 | FFF | 6.23
7. a9a | GGG | 9.56
8. 847 | HHH | 9.23
9. e4c | III | 0.45
...
200. eb3 | XYZ | 10.34
And C (for simplicity with just 5 rows) as below
id | name | price
1. xyh | AA1 | 0.34
2. y5h | BB1 | 9.77
3. af6 | CC1 | 3.24
4. op9 | DD1 | 6.34
5. 23h | EE1 | 0.20
So, the output of my function should be as follows:
Row bdb in D should be matched with row af6
Row 441 in D should be matched with row af6
Row ae9 in D should be matched with row xyh
Row fb9 in D should be matched with row y5h
Row e6b in D should be matched with row op9
etc.
Upvotes: 1
Views: 242
Reputation: 403208
Use np.searchsorted
to index into a sorted version of C.price
.
p1 = D.price.values
v = np.sort(C.price.values)
p2 = v[np.searchsorted(v, p1) - 1]
p2
array([ 0.34, 3.24, 0.34, 9.77, 3.24, 3.24, 6.34, 6.34, 0.34])
Now, subtract them p2
from p1
.
pd.Series(p1 - p2)
0 2.00
1 0.32
2 0.93
3 0.01
4 1.89
5 2.99
6 3.22
7 2.89
8 0.11
dtype: float64
Upvotes: 1