Detecting almost duplicate rows

Question

Let's say I have a table that has dates and a value for each date (plus other columns). I can find the rows that have the same value on the same day by using

data.duplicated(subset=["VALUE", "DAY"], keep=False)

Now, say that I want to allow for the day to be off by 1 or 2, and the value to be off by up to 10, how do I do it?

Example:

DAY MTH YYY VALUE   NAME
22  9   2016    8.25    John
22  9   2016    43      John
6   11  2016    28.25   Mary
2   10  2016    50  George
23  11  2016    90  George
23  10  2016    30  Jenn
24  8   2016    10  Mike
24  9   2016    10  Mike
24  10  2016    10  Mike
24  11  2016    10  Mike
13  9   2016    170 Kathie
13  10  2016    170 Kathie
13  11  2016    160 Kathie
8   9   2016    16  Gina
9   10  2016    16  Gina
8   11  2016    16  Gina
16  11  2016    25  Ross
21  11  2016    45  Ross
23  9   2016    50  Shari
23  10  2016    50  Shari
23  11  2016    50  Shari

Using the above code I can find:

DAY MTH YYY VALUE   NAME
24  8   2016    10  Mike
24  9   2016    10  Mike
24  10  2016    10  Mike
24  11  2016    10  Mike
23  9   2016    50  Shari
23  10  2016    50  Shari
23  11  2016    50  Shari

However, I would like to also detect values 16 for Gina on Aug 8, Sep 9, and Oct 8, because they have same value and, though not same day, it is just a day off.

Similarly, I want to detect values on Sep 13, Oct 13, and Nov 13 for Kathie because the value is off just by 10.

How can I do this?

piRSquared · Accepted Answer

use numpy and triangle indexing to map all combinations

day = df.DAY.values
val = df.VALUE.values

i, j = np.triu_indices(len(df), k=1)
c1 = np.abs(day[i] - day[j]) < 2
c2 = np.abs(val[i] - val[j]) < 10

c = c1 & c2
df.iloc[np.unique(np.append(i[c], j[c]))]

    DAY  MTH   YYY  VALUE    NAME
1    22    9  2016   43.0    John
6    24    8  2016   10.0    Mike
7    24    9  2016   10.0    Mike
8    24   10  2016   10.0    Mike
9    24   11  2016   10.0    Mike
10   13    9  2016  170.0  Kathie
11   13   10  2016  170.0  Kathie
13    8    9  2016   16.0    Gina
14    9   10  2016   16.0    Gina
15    8   11  2016   16.0    Gina
17   21   11  2016   45.0    Ross
18   23    9  2016   50.0   Shari
19   23   10  2016   50.0   Shari
20   23   11  2016   50.0   Shari

Detecting almost duplicate rows

Answers (2)

Related Questions