tj judge
tj judge

Reputation: 616

Match values in different data frame and find closest value(s)

I have a dataframe:

4  Amazon        2      x  0.0     2.0     4.0    6.0    8.0
5  Amazon        2      y  0.0     1.0     2.0    3.0    4.0

df2:

     Amazon   2      60
     Netflix  1      100
     Netflix  2      110

I am trying to compare the slope values in the axis column to the corresponding optimal cost values and extract the slope, x and y values that are closest to the optimal cost.

Expected output:

0  Amazon        1      120    2   0.8
1  Amazon        2      57     4   2
      

Upvotes: 0

Views: 111

Answers (2)

Corralien
Corralien

Reputation: 120509

# Extract data and rearrange index
# Now slope and optim have the same index
slope = df1.loc[df1["Axis"] == "slope"].set_index(["Name", "Segment"]).drop(columns="Axis")
optim = df2.set_index(["Name", "Segment"]).reindex(slope.index)

# Find the closest column to the optimal cost
idx = slope.sub(optim.values).abs().idxmin(axis="columns")
>>> idx
Name    Segment
Amazon  1          3  # column '3' 120 <- optimal: 115
        2          3  # column '3' 57 <- optimal: 60
dtype: object
>>> df1.set_index(["Name", "Segment", "Axis"]) \
       .groupby(["Name", "Segment"], as_index=False) \
       .apply(lambda x: x[idx[x.name]]).unstack() \
       .rename_axis(columns=None).reset_index(["Name", "Segment"])

     Name  Segment  slope    x    y
0  Amazon        1  120.0  2.0  0.8
1  Amazon        2   57.0  4.0  2.0

Upvotes: 0

Cameron Riddell
Cameron Riddell

Reputation: 13417

You can use pd.merge_asof to perform this type of merge quickly. However there is some preprocessing you'll need to do to your data.

  • reshape df1 to match the format of the expected output (e.g. where "slope", "x", and "y" are columns instead of rows
  • drop NaNs from the merge keys AND sort both df1 and df2 by their merge keys (this is a requirement of pd.merge_asof that we need to do explicitly). Merge keys are going to be the "slope" and "optimal cost" columns.
  • Ensure that the merge keys are of the same dtype (in this case they should both be floats, meaning we'll need to convert "optimal cost" to a float type instead of int.
  • perform the merge operation
# Reshape df1
df1_reshaped = df1.set_index(["Name", "Segment", "Axis"]).unstack(-1).stack(0)

# Drop NaN, sort_values by the merge keys, ensure merge keys are same dtype
df1_reshaped = df1_reshaped.dropna(subset=["slope"]).sort_values("slope")
df2 = df2.sort_values("Optimal Cost").astype({"Optimal Cost": float})

# Perform the merge
out = (
    pd.merge_asof(
        df2, 
        df1_reshaped, 
        left_on="Optimal Cost", 
        right_on="slope", 
        by=["Name", "Segment"],
        direction="nearest"
    ).dropna()
)

print(out)
     Name  Segment  Optimal Cost  slope    x    y
0  Amazon        2          60.0   57.0  4.0  2.0
3  Amazon        1         115.0  120.0  2.0  0.8

And that's it!

If you're curious, here are what df1_reshaped and df2 look like prior to the merge (after the preprocessing).

>>> print(df1_reshaped)
Axis              slope    x    y
Name   Segment
Amazon 2       2   50.0  2.0  1.0
               3   57.0  4.0  2.0
               4   72.0  6.0  3.0
               5   81.0  8.0  4.0
       1       2  100.0  1.0  0.4
               3  120.0  2.0  0.8
               4  127.0  3.0  1.2
               5  140.0  4.0  1.6

>>> print(df2)
      Name  Segment  Optimal Cost
1   Amazon        2          60.0
2  Netflix        1         100.0
3  Netflix        2         110.0
0   Amazon        1         115.0

Upvotes: 3

Related Questions