Match values in different data frame and find closest value(s)

Question

I have a dataframe:

4  Amazon        2      x  0.0     2.0     4.0    6.0    8.0
5  Amazon        2      y  0.0     1.0     2.0    3.0    4.0

df2:

     Amazon   2      60
     Netflix  1      100
     Netflix  2      110

I am trying to compare the slope values in the axis column to the corresponding optimal cost values and extract the slope, x and y values that are closest to the optimal cost.

Expected output:

0  Amazon        1      120    2   0.8
1  Amazon        2      57     4   2

Cameron Riddell · Accepted Answer

You can use pd.merge_asof to perform this type of merge quickly. However there is some preprocessing you'll need to do to your data.

reshape df1 to match the format of the expected output (e.g. where "slope", "x", and "y" are columns instead of rows
drop NaNs from the merge keys AND sort both df1 and df2 by their merge keys (this is a requirement of pd.merge_asof that we need to do explicitly). Merge keys are going to be the "slope" and "optimal cost" columns.
Ensure that the merge keys are of the same dtype (in this case they should both be floats, meaning we'll need to convert "optimal cost" to a float type instead of int.
perform the merge operation

# Reshape df1
df1_reshaped = df1.set_index(["Name", "Segment", "Axis"]).unstack(-1).stack(0)

# Drop NaN, sort_values by the merge keys, ensure merge keys are same dtype
df1_reshaped = df1_reshaped.dropna(subset=["slope"]).sort_values("slope")
df2 = df2.sort_values("Optimal Cost").astype({"Optimal Cost": float})

# Perform the merge
out = (
    pd.merge_asof(
        df2, 
        df1_reshaped, 
        left_on="Optimal Cost", 
        right_on="slope", 
        by=["Name", "Segment"],
        direction="nearest"
    ).dropna()
)

print(out)
     Name  Segment  Optimal Cost  slope    x    y
0  Amazon        2          60.0   57.0  4.0  2.0
3  Amazon        1         115.0  120.0  2.0  0.8

And that's it!

If you're curious, here are what df1_reshaped and df2 look like prior to the merge (after the preprocessing).

>>> print(df1_reshaped)
Axis              slope    x    y
Name   Segment
Amazon 2       2   50.0  2.0  1.0
               3   57.0  4.0  2.0
               4   72.0  6.0  3.0
               5   81.0  8.0  4.0
       1       2  100.0  1.0  0.4
               3  120.0  2.0  0.8
               4  127.0  3.0  1.2
               5  140.0  4.0  1.6

>>> print(df2)
      Name  Segment  Optimal Cost
1   Amazon        2          60.0
2  Netflix        1         100.0
3  Netflix        2         110.0
0   Amazon        1         115.0

Match values in different data frame and find closest value(s)

Answers (2)

Related Questions