Phil
Phil

Reputation: 11

Find matching Pandas dataframe rows with subset of column values within tolerance

I have a CSV file from a time-ordered data capture that represents data values flowing into a system interleaved with data values flowing out. The output is downsampled, and some values suffer a slight degradation in resolution, whereas others undergo unit conversion. But one should be able to unambiguously match an output row with the input row that preceded. I'm looking for a way to do it without dropping to for loops.

Here's a toy dataset with random numbers. Columns A-E are the input values and Columns F-I are the outputs:

A B C D E F G H I
142 268.4620314 0.88795354 -36.14294272 24.83502189 -5.888309718
143 260.5989964 0.575051317 22.72606608 -6.997301513 26.98494968
144 -5.344932556 1.699924469 183.7252502 TRUE
145 238.5640776 0.56440478 -25.11744339 28.06150565 15.57436172
146 146.5759824 0.511427295 -34.70449463 -4.724835838 -3.086899544
147 81.38048808 0.818582364 -30.30649328 12.9429191 18.07471308
148 101.7955178 0.964528989 34.94924089 -34.15163516 5.503329208
149 124.8002222 0.834127362 8.467252179 34.6700403 -5.337944916
150 22.72605133 -6.99729538 260.5988159 TRUE
151 144.1608728 0.598091181 36.13362969 10.90960903 -18.82224584
152 119.9718214 1.003923003 -32.6886618 -29.58910529 -11.96037248

Column I indicates that the row is an output row. Note that row 150 has numbers that match row 143. F:150 matches C:143, G:150 matches D:143, and H:150 matches A:143. What column pairs should match is fixed and known; i.e., column pairs F/C, G/D, and H/A will always match.

There are more columns than shown, this is reduced to demonstrate the concept. Fortunately, if I must drop to for loops, the dataset is only a few thousand rows.

The puzzle is to find the matching input row that precedes each output row, which could be any number of rows prior.

I tried finding the answer to this question on stackoverflow and elsewhere and came close, but I think my case has enough wrinkles to perhaps warrant its own entry. None of the pages I found quite match my use case, because

  1. All of the rows are in one dataframe
  2. None of the rows in the dataframe match across all columns,
  3. And the value matching involves applying tolerances and/or unit conversions.

I can avoid the columns that require unit conversions; one thing I must do is verify the conversions are done correctly, so can't really rely on those columns to match rows. For comparison with a tolerance, one page had the hint of using np.isclose. I thought maybe I should split the dataframe so the outputs and inputs are separated, but all the solutions I found seemed to suggest that joining the two is part of the solution; perhaps there is a particular way they should be joined, I just don't know how.

But it seems, given the other things I have to do once the input and output rows are matched, it would be most efficient to have a new dataframe with the output values concatenated on the matching input row (leaving the other rows intact).

Upvotes: 1

Views: 793

Answers (2)

jqurious
jqurious

Reputation: 21319

It sounds like it could be a .merge_asof problem.

That can be used to find "near" matches - which can then be filtered to find the match with the minimum distance.

# split up into inputs/outputs
inputs  = df.loc[ df["I"].isna(), slice("A", "E")].reset_index()
outputs = df.loc[~df["I"].isna(), slice("F", "H")].reset_index()

mapping = [("A", "H"), ("C", "F"), ("D", "G")]

# Find near matches for each criteria
# drop_duplicates as rows could match on more than 1
near = pd.concat(
   [
      pd.merge_asof(
         inputs.sort_values(left),
         outputs.sort_values(right),
         left_on=left, 
         right_on=right, 
         direction="nearest",
         # tolerance=0.001,
      )
      for left, right in mapping
   ], 
   ignore_index=True
).drop_duplicates(subset=["index_x", "index_y"])

# Keep only matches where input comes before output
near = near.loc[near["index_x"] < near["index_y"]]

# Calculate distance for each column pair and total distance
for left, right in mapping:
   near[f"{left}/{right}"] = (near[left] - near[right]).abs()
   
near["distance"] = (
   near[[f"{left}/{right}" for left, right in mapping]].sum(axis=1)
)
    index_x           A         B          C          D          E  index_y          F         G           H     I         A/H        C/F        D/G    distance
6       145  238.564078  0.564405 -25.117443  28.061506  15.574362      150  22.726051 -6.997295  260.598816  True   22.034738  47.843495  35.058801  104.937034
7       143  260.598996  0.575051  22.726066  -6.997302  26.984950      150  22.726051 -6.997295  260.598816  True    0.000180   0.000015   0.000006    0.000201
8       142  268.462031  0.887954 -36.142943  24.835022  -5.888310      150  22.726051 -6.997295  260.598816  True    7.863216  58.868994  31.832317   98.564527
9       142  268.462031  0.887954 -36.142943  24.835022  -5.888310      144  -5.344933  1.699924  183.725250  True   84.736781  30.798010  23.135097  138.669889
16      148  101.795518  0.964529  34.949241 -34.151635   5.503329      150  22.726051 -6.997295  260.598816  True  158.803298  12.223190  27.154340  198.180827
21      146  146.575982  0.511427 -34.704495  -4.724836  -3.086900      150  22.726051 -6.997295  260.598816  True  114.022833  57.430546   2.272460  173.725839

For each output find the row with the minimum distance:

>>> near.loc[near.groupby("index_y")["distance"].idxmin()]
   index_x           A         B          C          D         E  index_y          F         G           H     I        A/H        C/F        D/G    distance
9      142  268.462031  0.887954 -36.142943  24.835022  -5.88831      144  -5.344933  1.699924  183.725250  True  84.736781  30.798010  23.135097  138.669889
7      143  260.598996  0.575051  22.726066  -6.997302  26.98495      150  22.726051 -6.997295  260.598816  True   0.000180   0.000015   0.000006    0.000201

In this case 142 is matched to 144 and 143 is matched to 150.

tolerance can be used to limit the range of what is considered a match.

pd.merge_asof(
   inputs.sort_values("C"),
   outputs.sort_values("F"),
   left_on="C", 
   right_on="F", 
   direction="nearest",
   tolerance=0.001
)
   index_x           A         B          C          D          E  index_y          F         G           H     I
0      142  268.462031  0.887954 -36.142943  24.835022  -5.888310      NaN        NaN       NaN         NaN   NaN
1      146  146.575982  0.511427 -34.704495  -4.724836  -3.086900      NaN        NaN       NaN         NaN   NaN
2      152  119.971821  1.003923 -32.688662 -29.589105 -11.960372      NaN        NaN       NaN         NaN   NaN
3      147   81.380488  0.818582 -30.306493  12.942919  18.074713      NaN        NaN       NaN         NaN   NaN
4      145  238.564078  0.564405 -25.117443  28.061506  15.574362      NaN        NaN       NaN         NaN   NaN
5      149  124.800222  0.834127   8.467252  34.670040  -5.337945      NaN        NaN       NaN         NaN   NaN
6      143  260.598996  0.575051  22.726066  -6.997302  26.984950    150.0  22.726051 -6.997295  260.598816  True
7      148  101.795518  0.964529  34.949241 -34.151635   5.503329      NaN        NaN       NaN         NaN   NaN
8      151  144.160873  0.598091  36.133630  10.909609 -18.822246      NaN        NaN       NaN         NaN   NaN

Upvotes: 1

mozway
mozway

Reputation: 260975

new answer: matching with fixed column mappings

mapper = {'F': 'C', 'G': 'D', 'H': 'A'}
cols = list(mapper)

# extract left part and convert to numpy array
out = df[mapper.values()].copy()
a1 = out.to_numpy()

# extract columns of right part and convert to array
a2 = df[cols].to_numpy()

# find indices of the complete matches of the broadcasted comparison
# here we define a tolerance with the difference
x, y = np.nonzero(np.tril(np.isclose(a1, a2[:, None], atol=0.0001)
                            .all(axis=-1)
                          )
              )

# this part is optional
# this is to deduplicate in case there can be more than one match
# this keeps the latest match
idx = len(x)-np.unique(x[::-1], return_index=True)[1]-1
x = x[idx]
y = y[idx]

# assign the matched rows
out.loc[out.index[y], cols] = a2[x]

Output:

             C          D           A          F         G           H
142 -36.142943  24.835022  268.462031        NaN       NaN         NaN
143  22.726066  -6.997302  260.598996  22.726051 -6.997295  260.598816
144        NaN        NaN         NaN        NaN       NaN         NaN
145 -25.117443  28.061506  238.564078        NaN       NaN         NaN
146 -34.704495  -4.724836  146.575982        NaN       NaN         NaN
147 -30.306493  12.942919   81.380488        NaN       NaN         NaN
148  34.949241 -34.151635  101.795518        NaN       NaN         NaN
149   8.467252  34.670040  124.800222        NaN       NaN         NaN
150        NaN        NaN         NaN        NaN       NaN         NaN
151  36.133630  10.909609  144.160873        NaN       NaN         NaN
152 -32.688662 -29.589105  119.971821        NaN       NaN         NaN

older answer: matching any cell independently of the columns

If you have a reasonable number of rows, one option would be to perform numpy broadcasting, comparing all non-NA cells of the left part with all of the right part and finding combinations that match on all columns of the right part:

cols1 = slice('A', 'E')
cols2 = slice('F', 'H')

# extract left part without NaN and convert to numpy array
out = df.loc[:, cols1].dropna(how='all')
a1 = out.to_numpy()

# extract columns of right part and convert to array
cols = df.loc[:, cols2].columns
a2 = df.loc[:, cols].dropna(how='all').to_numpy()

# find indices of the complete matches of the broadcasted comparison
# here we define a tolerance with the difference
x, y = np.nonzero(np.isclose(a1, a2[..., None, None], atol=0.0001)
                    .any(axis=-1).all(axis=1)
                  )

# this part is optional
# this is to deduplicate in case there can be more than one match
# this keeps the latest match
idx = len(x)-np.unique(x[::-1], return_index=True)[1]-1
x = x[idx]
y = y[idx]

# assign the matched rows
out.loc[out.index[y], cols] = a2[x]

Output:

              A         B          C          D          E          F         G           H
142  268.462031  0.887954 -36.142943  24.835022  -5.888310        NaN       NaN         NaN
143  260.598996  0.575051  22.726066  -6.997302  26.984950  22.726051 -6.997295  260.598816
145  238.564078  0.564405 -25.117443  28.061506  15.574362        NaN       NaN         NaN
146  146.575982  0.511427 -34.704495  -4.724836  -3.086900        NaN       NaN         NaN
147   81.380488  0.818582 -30.306493  12.942919  18.074713        NaN       NaN         NaN
148  101.795518  0.964529  34.949241 -34.151635   5.503329        NaN       NaN         NaN
149  124.800222  0.834127   8.467252  34.670040  -5.337945        NaN       NaN         NaN
151  144.160873  0.598091  36.133630  10.909609 -18.822246        NaN       NaN         NaN
152  119.971821  1.003923 -32.688662 -29.589105 -11.960372        NaN       NaN         NaN

Example on a different input with duplicates and matches for all rows of the right part:

                A         B          C          D          E          F         G           H
141.0  183.725250  0.987654  -5.344932   1.699925   1.234568  -5.344933  1.699924  183.725250
142.0  268.462031  0.887954 -36.142943  24.835022  -5.888310        NaN       NaN         NaN
143.0  260.598996  0.575051  22.726066  -6.997302  26.984950        NaN       NaN         NaN
143.5  260.598996  0.575051  22.726066  -6.997302  26.984950  22.726051 -6.997295  260.598816
145.0  238.564078  0.564405 -25.117443  28.061506  15.574362        NaN       NaN         NaN
146.0  146.575982  0.511427 -34.704495  -4.724836  -3.086900        NaN       NaN         NaN
147.0   81.380488  0.818582 -30.306493  12.942919  18.074713        NaN       NaN         NaN
148.0  101.795518  0.964529  34.949241 -34.151635   5.503329        NaN       NaN         NaN
149.0  124.800222  0.834127   8.467252  34.670040  -5.337945        NaN       NaN         NaN
151.0  144.160873  0.598091  36.133630  10.909609 -18.822246        NaN       NaN         NaN
152.0  119.971821  1.003923 -32.688662 -29.589105 -11.960372        NaN       NaN         NaN
153.0  184.725250  0.987654  -5.344932   1.699925   1.234568        NaN       NaN         NaN

If you want to restrict to the previous values, you need to compare with the original full shape and np.tril, you can drop the NaN later:

cols1 = slice('A', 'E')
cols2 = slice('F', 'H')

out = df.loc[:, cols1]
a1 = out.to_numpy()

cols = df.loc[:, cols2].columns
a2 = df.loc[:, cols].to_numpy()

x, y = np.nonzero(np.tril(np.isclose(a1, a2[..., None, None],
                                     atol=0.0001)
                            .any(axis=-1).all(axis=1)
                         ))

idx = len(x)-np.unique(x[::-1], return_index=True)[1]-1
x = x[idx]
y = y[idx]

out.loc[out.index[y], cols] = a2[x]

Output:

                A         B          C          D          E          F         G           H
141.0  183.725250  0.987654  -5.344932   1.699925   1.234568  -5.344933  1.699924  183.725250
142.0  268.462031  0.887954 -36.142943  24.835022  -5.888310        NaN       NaN         NaN
143.0  260.598996  0.575051  22.726066  -6.997302  26.984950        NaN       NaN         NaN
143.5  260.598996  0.575051  22.726066  -6.997302  26.984950  22.726051 -6.997295  260.598816
144.0         NaN       NaN        NaN        NaN        NaN        NaN       NaN         NaN
145.0  238.564078  0.564405 -25.117443  28.061506  15.574362        NaN       NaN         NaN
146.0  146.575982  0.511427 -34.704495  -4.724836  -3.086900        NaN       NaN         NaN
147.0   81.380488  0.818582 -30.306493  12.942919  18.074713        NaN       NaN         NaN
148.0  101.795518  0.964529  34.949241 -34.151635   5.503329        NaN       NaN         NaN
149.0  124.800222  0.834127   8.467252  34.670040  -5.337945        NaN       NaN         NaN
150.0         NaN       NaN        NaN        NaN        NaN        NaN       NaN         NaN
151.0  144.160873  0.598091  36.133630  10.909609 -18.822246        NaN       NaN         NaN
152.0  119.971821  1.003923 -32.688662 -29.589105 -11.960372        NaN       NaN         NaN
153.0  184.725250  0.987654  -5.344932   1.699925   1.234568        NaN       NaN         NaN

Upvotes: 1

Related Questions