Reputation: 11
I have a CSV file from a time-ordered data capture that represents data values flowing into a system interleaved with data values flowing out. The output is downsampled, and some values suffer a slight degradation in resolution, whereas others undergo unit conversion. But one should be able to unambiguously match an output row with the input row that preceded. I'm looking for a way to do it without dropping to for loops.
Here's a toy dataset with random numbers. Columns A-E are the input values and Columns F-I are the outputs:
A | B | C | D | E | F | G | H | I | |
---|---|---|---|---|---|---|---|---|---|
142 | 268.4620314 | 0.88795354 | -36.14294272 | 24.83502189 | -5.888309718 | ||||
143 | 260.5989964 | 0.575051317 | 22.72606608 | -6.997301513 | 26.98494968 | ||||
144 | -5.344932556 | 1.699924469 | 183.7252502 | TRUE | |||||
145 | 238.5640776 | 0.56440478 | -25.11744339 | 28.06150565 | 15.57436172 | ||||
146 | 146.5759824 | 0.511427295 | -34.70449463 | -4.724835838 | -3.086899544 | ||||
147 | 81.38048808 | 0.818582364 | -30.30649328 | 12.9429191 | 18.07471308 | ||||
148 | 101.7955178 | 0.964528989 | 34.94924089 | -34.15163516 | 5.503329208 | ||||
149 | 124.8002222 | 0.834127362 | 8.467252179 | 34.6700403 | -5.337944916 | ||||
150 | 22.72605133 | -6.99729538 | 260.5988159 | TRUE | |||||
151 | 144.1608728 | 0.598091181 | 36.13362969 | 10.90960903 | -18.82224584 | ||||
152 | 119.9718214 | 1.003923003 | -32.6886618 | -29.58910529 | -11.96037248 |
Column I indicates that the row is an output row. Note that row 150 has numbers that match row 143. F:150 matches C:143, G:150 matches D:143, and H:150 matches A:143. What column pairs should match is fixed and known; i.e., column pairs F/C, G/D, and H/A will always match.
There are more columns than shown, this is reduced to demonstrate the concept. Fortunately, if I must drop to for loops, the dataset is only a few thousand rows.
The puzzle is to find the matching input row that precedes each output row, which could be any number of rows prior.
I tried finding the answer to this question on stackoverflow and elsewhere and came close, but I think my case has enough wrinkles to perhaps warrant its own entry. None of the pages I found quite match my use case, because
I can avoid the columns that require unit conversions; one thing I must do is verify the conversions are done correctly, so can't really rely on those columns to match rows. For comparison with a tolerance, one page had the hint of using np.isclose. I thought maybe I should split the dataframe so the outputs and inputs are separated, but all the solutions I found seemed to suggest that joining the two is part of the solution; perhaps there is a particular way they should be joined, I just don't know how.
But it seems, given the other things I have to do once the input and output rows are matched, it would be most efficient to have a new dataframe with the output values concatenated on the matching input row (leaving the other rows intact).
Upvotes: 1
Views: 793
Reputation: 21319
It sounds like it could be a .merge_asof
problem.
That can be used to find "near" matches - which can then be filtered to find the match with the minimum distance.
# split up into inputs/outputs
inputs = df.loc[ df["I"].isna(), slice("A", "E")].reset_index()
outputs = df.loc[~df["I"].isna(), slice("F", "H")].reset_index()
mapping = [("A", "H"), ("C", "F"), ("D", "G")]
# Find near matches for each criteria
# drop_duplicates as rows could match on more than 1
near = pd.concat(
[
pd.merge_asof(
inputs.sort_values(left),
outputs.sort_values(right),
left_on=left,
right_on=right,
direction="nearest",
# tolerance=0.001,
)
for left, right in mapping
],
ignore_index=True
).drop_duplicates(subset=["index_x", "index_y"])
# Keep only matches where input comes before output
near = near.loc[near["index_x"] < near["index_y"]]
# Calculate distance for each column pair and total distance
for left, right in mapping:
near[f"{left}/{right}"] = (near[left] - near[right]).abs()
near["distance"] = (
near[[f"{left}/{right}" for left, right in mapping]].sum(axis=1)
)
index_x A B C D E index_y F G H I A/H C/F D/G distance
6 145 238.564078 0.564405 -25.117443 28.061506 15.574362 150 22.726051 -6.997295 260.598816 True 22.034738 47.843495 35.058801 104.937034
7 143 260.598996 0.575051 22.726066 -6.997302 26.984950 150 22.726051 -6.997295 260.598816 True 0.000180 0.000015 0.000006 0.000201
8 142 268.462031 0.887954 -36.142943 24.835022 -5.888310 150 22.726051 -6.997295 260.598816 True 7.863216 58.868994 31.832317 98.564527
9 142 268.462031 0.887954 -36.142943 24.835022 -5.888310 144 -5.344933 1.699924 183.725250 True 84.736781 30.798010 23.135097 138.669889
16 148 101.795518 0.964529 34.949241 -34.151635 5.503329 150 22.726051 -6.997295 260.598816 True 158.803298 12.223190 27.154340 198.180827
21 146 146.575982 0.511427 -34.704495 -4.724836 -3.086900 150 22.726051 -6.997295 260.598816 True 114.022833 57.430546 2.272460 173.725839
For each output find the row with the minimum distance:
>>> near.loc[near.groupby("index_y")["distance"].idxmin()]
index_x A B C D E index_y F G H I A/H C/F D/G distance
9 142 268.462031 0.887954 -36.142943 24.835022 -5.88831 144 -5.344933 1.699924 183.725250 True 84.736781 30.798010 23.135097 138.669889
7 143 260.598996 0.575051 22.726066 -6.997302 26.98495 150 22.726051 -6.997295 260.598816 True 0.000180 0.000015 0.000006 0.000201
In this case 142
is matched to 144
and 143
is matched to 150
.
tolerance
can be used to limit the range of what is considered a match.
pd.merge_asof(
inputs.sort_values("C"),
outputs.sort_values("F"),
left_on="C",
right_on="F",
direction="nearest",
tolerance=0.001
)
index_x A B C D E index_y F G H I
0 142 268.462031 0.887954 -36.142943 24.835022 -5.888310 NaN NaN NaN NaN NaN
1 146 146.575982 0.511427 -34.704495 -4.724836 -3.086900 NaN NaN NaN NaN NaN
2 152 119.971821 1.003923 -32.688662 -29.589105 -11.960372 NaN NaN NaN NaN NaN
3 147 81.380488 0.818582 -30.306493 12.942919 18.074713 NaN NaN NaN NaN NaN
4 145 238.564078 0.564405 -25.117443 28.061506 15.574362 NaN NaN NaN NaN NaN
5 149 124.800222 0.834127 8.467252 34.670040 -5.337945 NaN NaN NaN NaN NaN
6 143 260.598996 0.575051 22.726066 -6.997302 26.984950 150.0 22.726051 -6.997295 260.598816 True
7 148 101.795518 0.964529 34.949241 -34.151635 5.503329 NaN NaN NaN NaN NaN
8 151 144.160873 0.598091 36.133630 10.909609 -18.822246 NaN NaN NaN NaN NaN
Upvotes: 1
Reputation: 260975
mapper = {'F': 'C', 'G': 'D', 'H': 'A'}
cols = list(mapper)
# extract left part and convert to numpy array
out = df[mapper.values()].copy()
a1 = out.to_numpy()
# extract columns of right part and convert to array
a2 = df[cols].to_numpy()
# find indices of the complete matches of the broadcasted comparison
# here we define a tolerance with the difference
x, y = np.nonzero(np.tril(np.isclose(a1, a2[:, None], atol=0.0001)
.all(axis=-1)
)
)
# this part is optional
# this is to deduplicate in case there can be more than one match
# this keeps the latest match
idx = len(x)-np.unique(x[::-1], return_index=True)[1]-1
x = x[idx]
y = y[idx]
# assign the matched rows
out.loc[out.index[y], cols] = a2[x]
Output:
C D A F G H
142 -36.142943 24.835022 268.462031 NaN NaN NaN
143 22.726066 -6.997302 260.598996 22.726051 -6.997295 260.598816
144 NaN NaN NaN NaN NaN NaN
145 -25.117443 28.061506 238.564078 NaN NaN NaN
146 -34.704495 -4.724836 146.575982 NaN NaN NaN
147 -30.306493 12.942919 81.380488 NaN NaN NaN
148 34.949241 -34.151635 101.795518 NaN NaN NaN
149 8.467252 34.670040 124.800222 NaN NaN NaN
150 NaN NaN NaN NaN NaN NaN
151 36.133630 10.909609 144.160873 NaN NaN NaN
152 -32.688662 -29.589105 119.971821 NaN NaN NaN
If you have a reasonable number of rows, one option would be to perform numpy broadcasting, comparing all non-NA cells of the left part with all of the right part and finding combinations that match on all columns of the right part:
cols1 = slice('A', 'E')
cols2 = slice('F', 'H')
# extract left part without NaN and convert to numpy array
out = df.loc[:, cols1].dropna(how='all')
a1 = out.to_numpy()
# extract columns of right part and convert to array
cols = df.loc[:, cols2].columns
a2 = df.loc[:, cols].dropna(how='all').to_numpy()
# find indices of the complete matches of the broadcasted comparison
# here we define a tolerance with the difference
x, y = np.nonzero(np.isclose(a1, a2[..., None, None], atol=0.0001)
.any(axis=-1).all(axis=1)
)
# this part is optional
# this is to deduplicate in case there can be more than one match
# this keeps the latest match
idx = len(x)-np.unique(x[::-1], return_index=True)[1]-1
x = x[idx]
y = y[idx]
# assign the matched rows
out.loc[out.index[y], cols] = a2[x]
Output:
A B C D E F G H
142 268.462031 0.887954 -36.142943 24.835022 -5.888310 NaN NaN NaN
143 260.598996 0.575051 22.726066 -6.997302 26.984950 22.726051 -6.997295 260.598816
145 238.564078 0.564405 -25.117443 28.061506 15.574362 NaN NaN NaN
146 146.575982 0.511427 -34.704495 -4.724836 -3.086900 NaN NaN NaN
147 81.380488 0.818582 -30.306493 12.942919 18.074713 NaN NaN NaN
148 101.795518 0.964529 34.949241 -34.151635 5.503329 NaN NaN NaN
149 124.800222 0.834127 8.467252 34.670040 -5.337945 NaN NaN NaN
151 144.160873 0.598091 36.133630 10.909609 -18.822246 NaN NaN NaN
152 119.971821 1.003923 -32.688662 -29.589105 -11.960372 NaN NaN NaN
Example on a different input with duplicates and matches for all rows of the right part:
A B C D E F G H
141.0 183.725250 0.987654 -5.344932 1.699925 1.234568 -5.344933 1.699924 183.725250
142.0 268.462031 0.887954 -36.142943 24.835022 -5.888310 NaN NaN NaN
143.0 260.598996 0.575051 22.726066 -6.997302 26.984950 NaN NaN NaN
143.5 260.598996 0.575051 22.726066 -6.997302 26.984950 22.726051 -6.997295 260.598816
145.0 238.564078 0.564405 -25.117443 28.061506 15.574362 NaN NaN NaN
146.0 146.575982 0.511427 -34.704495 -4.724836 -3.086900 NaN NaN NaN
147.0 81.380488 0.818582 -30.306493 12.942919 18.074713 NaN NaN NaN
148.0 101.795518 0.964529 34.949241 -34.151635 5.503329 NaN NaN NaN
149.0 124.800222 0.834127 8.467252 34.670040 -5.337945 NaN NaN NaN
151.0 144.160873 0.598091 36.133630 10.909609 -18.822246 NaN NaN NaN
152.0 119.971821 1.003923 -32.688662 -29.589105 -11.960372 NaN NaN NaN
153.0 184.725250 0.987654 -5.344932 1.699925 1.234568 NaN NaN NaN
If you want to restrict to the previous values, you need to compare with the original full shape and np.tril
, you can drop the NaN later:
cols1 = slice('A', 'E')
cols2 = slice('F', 'H')
out = df.loc[:, cols1]
a1 = out.to_numpy()
cols = df.loc[:, cols2].columns
a2 = df.loc[:, cols].to_numpy()
x, y = np.nonzero(np.tril(np.isclose(a1, a2[..., None, None],
atol=0.0001)
.any(axis=-1).all(axis=1)
))
idx = len(x)-np.unique(x[::-1], return_index=True)[1]-1
x = x[idx]
y = y[idx]
out.loc[out.index[y], cols] = a2[x]
Output:
A B C D E F G H
141.0 183.725250 0.987654 -5.344932 1.699925 1.234568 -5.344933 1.699924 183.725250
142.0 268.462031 0.887954 -36.142943 24.835022 -5.888310 NaN NaN NaN
143.0 260.598996 0.575051 22.726066 -6.997302 26.984950 NaN NaN NaN
143.5 260.598996 0.575051 22.726066 -6.997302 26.984950 22.726051 -6.997295 260.598816
144.0 NaN NaN NaN NaN NaN NaN NaN NaN
145.0 238.564078 0.564405 -25.117443 28.061506 15.574362 NaN NaN NaN
146.0 146.575982 0.511427 -34.704495 -4.724836 -3.086900 NaN NaN NaN
147.0 81.380488 0.818582 -30.306493 12.942919 18.074713 NaN NaN NaN
148.0 101.795518 0.964529 34.949241 -34.151635 5.503329 NaN NaN NaN
149.0 124.800222 0.834127 8.467252 34.670040 -5.337945 NaN NaN NaN
150.0 NaN NaN NaN NaN NaN NaN NaN NaN
151.0 144.160873 0.598091 36.133630 10.909609 -18.822246 NaN NaN NaN
152.0 119.971821 1.003923 -32.688662 -29.589105 -11.960372 NaN NaN NaN
153.0 184.725250 0.987654 -5.344932 1.699925 1.234568 NaN NaN NaN
Upvotes: 1