Find duplicates in dataframe with tolerance in one column instead of exact value

Question

I have a dataframe of expense claims made by staff:

import pandas as pd

data = {'Claim ID': [1, 2, 3, 4, 5, 6, 7],
        'User': ['John', 'John', 'Jake', 'Bob', 'Bob', 'Tom', 'Tom'],
        'Category': ['Meal', 'Meal', 'Stationary', 'Phone Charges', 'Phone Charges', 'Transport', 'Transport'],
        'Amount': [12.00, 13.00, 20.00, 30, 30, 60, 60]}

df = pd.DataFrame(data)

Output:
     Claim ID  User       Category  Amount
            1  John           Meal    12.0
            2  John           Meal    13.0
            3  Jake     Stationary    20.0
            4   Bob  Phone Charges    30.0
            5   Bob  Phone Charges    30.0
            6   Tom      Transport    60.0
            7   Tom      Transport    60.0

I used the following code to find duplicate claims based on User, Category and Amount and gave a unique group number to each set of duplicates found:

# Tag each duplicate set with a unique number
conditions = ['User', 'Amount', 'Category']
df['Group'] = df.groupby(conditions).ngroup().add(1)

# Then remove groups with only one row
df = df[df.groupby('Group')['Group'].transform('count') > 1]

Output:
 Claim ID User       Category  Amount  Group
        4  Bob  Phone Charges    30.0      1
        5  Bob  Phone Charges    30.0      1
        6  Tom      Transport    60.0      5
        7  Tom      Transport    60.0      5

Now my question is, I want to find duplicates with the same User, Category, but instead of the exact same Amount, I want to allow a tolerance of a few dollars in the amount claimed, let's say around $1. So using the sample dataframe given, the expected output will be like this:

 Claim ID  User       Category  Amount  Group
        1  John           Meal    12.0      1
        2  John           Meal    13.0      1
        3   Tom      Transport    30.0      2
        4   Tom      Transport    30.0      2
        5   Bob  Phone Charges    60.0      3
        6   Bob  Phone Charges    60.0      3

JarroVGIT · Accepted Answer

I don't know if it is the fastest way, but it does work and works great for fuzzy conditions like tolerance:

df['group'] = np.piecewise(
    np.zeros(len(df)),
    [list((df.User.values == user) & (df.Category.values == category) & (df.Amount.values >= amount-1) & (df.Amount.values <= amount+1)) \
     for user, category, amount in zip(df.User.values, df.Category.values, df.Amount.values)],
    df['Claim ID'].values
)

df[df.groupby('group')['group'].transform('count') > 1]

# Result:
   Claim ID  User       Category  Amount  group
0         1  John           Meal    12.0    2.0
1         2  John           Meal    13.0    2.0
3         4   Bob  Phone Charges    30.0    5.0
4         5   Bob  Phone Charges    30.0    5.0
5         6   Tom      Transport    60.0    7.0
6         7   Tom      Transport    60.0    7.0

Find duplicates in dataframe with tolerance in one column instead of exact value

Answers (1)

Related Questions