Ziva
Ziva

Reputation: 3521

Python Pandas Dataframe - compute difference between rows and take the minimum one

I have a Pandas Dataframe D storing a large database. I also have a smaller DataFrame C with 10 rows, containing exactly the same columns as the main one, including column 'price'. For each row r in the main dataframe D I want to find a row t in C which is closest, i.e., the difference between r.price and t.price is minimal. As an output I want to have:

I have a function computing the difference

def difference(self, row1, row2):
    d = np.abs(row1["price"] - row2["price"])
    return d

And I want to use apply function to apply the difference function to each row in C, for each row v in D:

for _, v in D.iterrows():
    C.apply(self.difference, axis=1, args=(v,))

But I don't know how I should find the row of C for which the difference was minimal. I was thinkking about the min build-in function from Python, but I don't know how to apply it correctly for dataframes.

An example: Let say I have a data D

    id | name | price
1.  bdb | AAA | 2.34
2.  441 | BBB | 3.56
3.  ae9 | CCC | 1.27
4.  fb9 | DDD | 9.78
5.  e6b | EEE | 5.13
6.  bx4 | FFF | 6.23
7.  a9a | GGG | 9.56
8.  847 | HHH | 9.23
9.  e4c | III | 0.45
...
200. eb3 | XYZ | 10.34

And C (for simplicity with just 5 rows) as below

    id | name | price
1.  xyh | AA1 | 0.34
2.  y5h | BB1 | 9.77
3.  af6 | CC1 | 3.24
4.  op9 | DD1 | 6.34
5.  23h | EE1 | 0.20

So, the output of my function should be as follows:

Row bdb in D should be matched with row af6
Row 441 in D should be matched with row af6
Row ae9 in D should be matched with row xyh 
Row fb9 in D should be matched with row y5h
Row e6b in D should be matched with row op9
etc.

Upvotes: 1

Views: 242

Answers (1)

cs95
cs95

Reputation: 403208

Use np.searchsorted to index into a sorted version of C.price.

p1 = D.price.values

v = np.sort(C.price.values)
p2 = v[np.searchsorted(v, p1) - 1]

p2
array([ 0.34,  3.24,  0.34,  9.77,  3.24,  3.24,  6.34,  6.34,  0.34])

Now, subtract them p2 from p1.

pd.Series(p1 - p2)

0    2.00
1    0.32
2    0.93
3    0.01
4    1.89
5    2.99
6    3.22
7    2.89
8    0.11
dtype: float64

Upvotes: 1

Related Questions