Ramsey
Ramsey

Reputation: 181

Replacing a value with previous number if certain conditions are met (pandas)

i have the following dataframe:

ID      SS
ABC    18.40
ABC
ABC    18
LMN    71
LMN    93
LMN    70
XYZ    18.5
XYZ    22
XYZ    

I am trying to create certain levels and the ones too close are making things too messy. So I want to replace a value which is <3% close to an earlier value within its respective group "ID" (The value does not necessarily hv to be immediate preceding value, it can be any value that has showed up within the group. The first in order gets to stay, see 'LMN' below). To make things clear here's the intended result:

ID      SS
ABC    18.40
ABC
ABC    18.40*
LMN    71
LMN    93
LMN    71*
XYZ    18.5
XYZ    22
XYZ   

I have put * where the changes are made. Also note that XYZ's first value (18.5) stays same although it is close to ABC's 18.40; that is because I want the 3% condition confined within the ID group.

I have been thinking of a viable solution but am unable to find one.

Little help will be dearly appreciated. THANKS!

Upvotes: 1

Views: 112

Answers (1)

Sander van den Oord
Sander van den Oord

Reputation: 12808

The solution comes down to grouping values by ID and checking the previous value in the group with .diff(). If this value is too large I add one to the count with .cumsum(): a new group number is started. All values that are close to eachother get the same group number. Finally I update values that have the same ID and are close to each other with the value that is seen .first()

If you want to look at percentage differences, you can do as follows:

df = df.sort_values(by=['ID', 'SS'])

max_perc_difference = 0.03
df['group'] = (
    (df.groupby(['ID'])['SS'].diff() / df['SS'])
    .gt(max_perc_difference)
    .cumsum()
)

df = df.sort_index()

df.loc[df['SS'].notna(), 'SS'] = (df
    .loc[df['SS'].notna()]
    .groupby(['ID', 'group'])
    .transform('first')
)

If you want to look at an absolute difference, you can do:

max_diff = 3
df['group'] = (df
    .sort_values(by=['ID', 'SS'])
    .groupby(['ID'])['SS']
    .diff()
    .gt(max_diff)
    .cumsum()
)

df.loc[df['SS'].notna(), 'SS'] = (df
    .loc[df['SS'].notna()]
    .groupby(['ID', 'group'])
    .transform('first')
)

Resulting dataframe:

    ID   SS     group
0   ABC  18.4   0
1   ABC  NaN    0
2   ABC  18.4   0
3   LMN  71.0   0
4   LMN  93.0   1
5   LMN  71.0   0
6   XYZ  18.5   1
7   XYZ  22.0   2
8   XYZ  NaN    2

If you do not want duplicate values in the same group and just want to keep the first, do:

df['SS'] = df['SS'].mask(df.groupby(['ID', 'SS']).cumcount()>0, np.nan)

See also:
Python pandas - how to group close elements

Upvotes: 1

Related Questions