Reputation: 181
i have the following dataframe:
ID SS
ABC 18.40
ABC
ABC 18
LMN 71
LMN 93
LMN 70
XYZ 18.5
XYZ 22
XYZ
I am trying to create certain levels and the ones too close are making things too messy. So I want to replace a value which is <3% close to an earlier value within its respective group "ID" (The value does not necessarily hv to be immediate preceding value, it can be any value that has showed up within the group. The first in order gets to stay, see 'LMN' below). To make things clear here's the intended result:
ID SS
ABC 18.40
ABC
ABC 18.40*
LMN 71
LMN 93
LMN 71*
XYZ 18.5
XYZ 22
XYZ
I have put * where the changes are made. Also note that XYZ's first value (18.5) stays same although it is close to ABC's 18.40; that is because I want the 3% condition confined within the ID group.
I have been thinking of a viable solution but am unable to find one.
Little help will be dearly appreciated. THANKS!
Upvotes: 1
Views: 112
Reputation: 12808
The solution comes down to grouping values by ID and checking the previous value in the group with .diff(). If this value is too large I add one to the count with .cumsum(): a new group number is started. All values that are close to eachother get the same group number. Finally I update values that have the same ID and are close to each other with the value that is seen .first()
If you want to look at percentage differences, you can do as follows:
df = df.sort_values(by=['ID', 'SS'])
max_perc_difference = 0.03
df['group'] = (
(df.groupby(['ID'])['SS'].diff() / df['SS'])
.gt(max_perc_difference)
.cumsum()
)
df = df.sort_index()
df.loc[df['SS'].notna(), 'SS'] = (df
.loc[df['SS'].notna()]
.groupby(['ID', 'group'])
.transform('first')
)
If you want to look at an absolute difference, you can do:
max_diff = 3
df['group'] = (df
.sort_values(by=['ID', 'SS'])
.groupby(['ID'])['SS']
.diff()
.gt(max_diff)
.cumsum()
)
df.loc[df['SS'].notna(), 'SS'] = (df
.loc[df['SS'].notna()]
.groupby(['ID', 'group'])
.transform('first')
)
Resulting dataframe:
ID SS group
0 ABC 18.4 0
1 ABC NaN 0
2 ABC 18.4 0
3 LMN 71.0 0
4 LMN 93.0 1
5 LMN 71.0 0
6 XYZ 18.5 1
7 XYZ 22.0 2
8 XYZ NaN 2
If you do not want duplicate values in the same group and just want to keep the first, do:
df['SS'] = df['SS'].mask(df.groupby(['ID', 'SS']).cumcount()>0, np.nan)
See also:
Python pandas - how to group close elements
Upvotes: 1