Replace detected outliers with equal amount of values or a multitude of values

Question

I'm analyzing a data set which consists of R-R Intervals which are fractions of seconds and in total are equal to 240 seconds or 4 minutes.

There are some outliers frequently in this data which I can manage to detect and replace using the following code:

RR's:

[1.076,
 0.957,
 0.939,
 0.956,
 0.942,
 0.888,
 0.821,
 0.903,
 0.806,
 0.971,
 0.973,
 0.999,
 0.909,
 0.811,
 0.921,
 0.915,
 0.917,
 0.801,
 0.815,
 0.94,
 0.946,
 0.963,
 0.911,
 0.804,
 1.015,
 0.981,
 0.971,
 0.901,
 0.809,
 1.005,
 0.962,
 0.986,
 0.911,
 0.816,
 0.99,
 0.948,
 0.969,
 0.832,
 0.932,
 0.928,
 0.972,
 0.932,
 0.809,
 0.932,
 0.928,
 0.974,
 0.943,
 0.81,
 0.936,
 0.918,
 0.94,
 0.941,
 0.826,
 0.909,
 0.913,
 0.947,
 0.965,
 0.913,
 0.895,
 0.941,
 0.963,
 0.954,
 0.905,
 0.803,
 0.954,
 0.917,
 0.949,
 0.896,
 0.777,
 0.913,
 0.905,
 0.957,
 0.918,
 0.799,
 0.966,
 0.975,
 1.021,
 0.986,
 0.839,
 0.938,
 0.95,
 0.973,
 0.959,
 0.823,
 0.842,
 0.959,
 0.946,
 0.983,
 0.924,
 0.814,
 1.002,
 0.972,
 0.98,
 0.967,
 0.836,
 0.95,
 0.957,
 1.012,
 0.939,
 0.812,
 0.964,
 0.937,
 0.963,
 0.931,
 0.92,
 0.963,
 0.995,
 0.985,
 0.914,
 0.914,
 0.943,
 0.977,
 0.957,
 0.822,
 0.926,
 0.932,
 0.992,
 1.061,
 0.999,
 0.841,
 0.983,
 0.955,
 0.972,
 0.823,
 0.809,
 0.769,
 0.765,
 0.728,
 0.697,
 0.699,
 0.694,
 0.694,
 0.695,
 0.689,
 0.692,
 0.697,
 0.76,
 0.669,
 0.676,
 0.673,
 0.67,
 0.668,
 0.665,
 0.666,
 0.753,
 0.778,
 **8.154**,
 0.784,
 0.762,
 0.741,
 0.743,
 0.752,
 0.836,
 0.738,
 0.838,
 0.813,
 0.807,
 0.798,
 0.793,
 0.784,
 0.71,
 0.729,
 0.73,
 0.801,
 0.771,
 0.709,
 0.798,
 0.778,
 0.782,
 0.712,
 0.804,
 0.781,
 0.784,
 0.774,
 0.779,
 0.785,
 0.786,
 0.773,
 0.77,
 0.769,
 0.771,
 0.766,
 0.773,
 0.784,
 0.79,
 0.789,
 0.779,
 0.784,
 0.792,
 0.795,
 0.786,
 0.784,
 0.784,
 0.791,
 0.784,
 0.783,
 0.783,
 0.785,
 0.787,
 0.776,
 0.792,
 0.807,
 0.81,
 0.814,
 0.824,
 0.833,
 0.839,
 0.807,
 0.795,
 0.795,
 0.789,
 0.777,
 0.759,
 0.745,
 0.748,
 0.756,
 0.759,
 0.753,
 0.767,
 0.783,
 0.793,
 0.787,
 0.793,
 0.797,
 0.813,
 0.826,
 0.805,
 0.779,
 0.771,
 0.762,
 0.746,
 0.737,
 0.739,
 0.745,
 0.746,
 0.691,
 0.771,
 0.765,
 0.805,
 0.807,
 0.789,
 0.806,
 0.811,
 0.8,
 0.732,
 0.798,
 0.771,
 0.761,
 0.705,
 0.775,
 0.76,
 0.771,
 0.775,
 0.777,
 0.797,
 0.806,
 0.799,
 0.786,
 0.789,
 0.789,
 0.782,
 0.769,
 0.776,
 0.78,
 0.786,
 0.77,
 0.774,
 0.782,
 0.785,
 0.787,
 0.79,
 0.786,
 0.773,
 0.779,
 0.787]

from scipy import stats
import pandas as pd
import numpy as np
    
df_rrs = pd.DataFrame(RRs, columns=['RRs'])
mask = (np.abs(stats.zscore(df_rrs['RRs'])) > 1)
df_rrs.RRs = df_rrs.RRs.mask(mask).interpolate()
df_rrs = df_rrs['RRs'].to_list()

The code does what it is supposed to detect and replace the value with an interpolated value. The problem is that this will leave a gap in the data since the cumulative sum is 4 minutes and in the example list of RR's the outlier is 8.154 seconds.

So my code example above will just replace it with one value and the dataset is basically shortened and missing 8 seconds.

So instead of replacing it with the mean/interpolation of the neighbouring values I need to replace it with multiple mean/interpolation values that equate in total 8.154 seconds. Ideally each of these values should be interpolated as well.

What would be the best approach to do this?

ouroboros1 · Accepted Answer

Here's one approach:

Minimal reproducible example

from scipy import stats
import pandas as pd
import numpy as np

np.random.seed(0) # for reproducibility

RRs = np.random.randint(40, 60, size=(8)).astype("float") / 100

desired_sum = 10 # total seconds

# adding 3 outliers at index `2, 3, 6`
RRs[[2, 3, 6]] += (desired_sum - RRs.sum())/3

df_rrs = pd.DataFrame(RRs, columns=['RRs'])

print(df_rrs['RRs'].sum())
# 10.0

    RRs
0  0.52
1  0.55
2  2.44 # outlier (consecutive: to be grouped)
3  2.47 # outlier (consecutive: to be grouped)
4  0.43
5  0.47
6  2.53 # outlier
7  0.59

Code

# mask for outliers
mask = (np.abs(stats.zscore(df_rrs['RRs'])) > 1)

# store outliers
outliers = df_rrs.loc[mask, 'RRs']

# group consecutive outliers
group_outliers = (outliers.index.to_series().diff() != 1).cumsum()

# grouper values as first index value per outlier group
grouper = group_outliers.index.to_series().mask(group_outliers.duplicated()).ffill()

# get sum per grouop
outliers_grouped = outliers.groupby(grouper).sum()

# determine rows needed per group
rows = np.round(outliers_grouped / df_rrs.loc[~mask, 'RRs'].mean())

# isolate index values we no longer want
outliers_excluded = outliers.index.difference(outliers_grouped.index)

# get index `df_rrs` without index values we no longer want
reindex = df_rrs.index.difference(outliers_excluded)

# reindex with `np.repeat` to get appropriate repeats for index values in `rows`
df_rrs = df_rrs.reindex(
    np.repeat(reindex, rows.reindex(reindex, fill_value=1))
    )

# interpolate
df_rrs['RRs'] = df_rrs['RRs'].mask(mask).interpolate()

# scale interpolated values
df_rrs.loc[mask, 'RRs'] = (
    df_rrs.loc[mask, 'RRs'] * (outliers_grouped 
                               / df_rrs.loc[mask, 'RRs'].groupby(level=0).sum())
    )

Output:

        RRs
0  0.520000
1  0.550000
2  0.540191 # index value 2, outlier interpolated (grouped)
2  0.529260
2  0.518328
2  0.507397
2  0.496466
2  0.485534
2  0.474603
2  0.463672
2  0.452740
2  0.441809
4  0.430000
5  0.470000
6  0.467811 # index value 6, outlier interpolated
6  0.486906
6  0.506000
6  0.525094
6  0.544189
7  0.590000

Sum check with np.isclose:

np.isclose(df_rrs['RRs'].sum(), desired_sum)
# True

Note that inherent float precision issues may cause a slight difference between the sum and desired_sum, so that df_rrs['RRs'].sum() == desired_sum may not always be True (it is here, incidentally). That seems unavoidable.

Explanation / intermediates

Create your mask and store the outlier subset as outliers.

outliers

2    2.44
3    2.47
6    2.53
Name: RRs, dtype: float64

Create groups for consecutive outliers, as we need a continuous interpolation for those (here, at 2, 3). Applying:
- index.to_series + Series.diff unequal to 1 and getting Series.cumsum, stored as group_outliers.
- For group_outliers: again index.to_series + Series.mask, passing Series.duplicated, and Series.ffill on the result.
- Use df.groupby to get the sum, stored here as outliers_grouped.

grouper

2    2.0 # one group
3    2.0 # one group
6    6.0
dtype: float64

outliers.groupby(grouper).sum()

2.0    4.91
6.0    2.53
Name: RRs, dtype: float64

Now, we need to find out how many rows we need per group, using np.round + outlier_grouped divided by the mean of correct values.

rows

2.0    10.0
6.0     5.0
Name: RRs, dtype: float64

We need to reindex using df.reindex. First exclude the index values that we cut out in the grouping process: index.difference with outliers and outliers_grouped.index.

outliers.index.difference(outliers_grouped.index)

Index([3], dtype='int64')

Next, get the difference between df_rrs.index and outliers_excluded, excluding in this case 3.
Now we can reindex using np.repeat to get 10 and 5 rows for our rows index and 1 for the others (fill_value=1).

df_rrs.index # reindexed

Index([0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 5, 6, 6, 6, 6, 6, 7], dtype='int64')

At this stage, we can finally use Series.interpolate.
The last step is to scale the interpolated values to reach desired_sum. For this we need another df.groupby on the interpolated values to get the sum per group as input for: interpolated values * (outliers_grouped / sum grouped interpolated values).

Replace detected outliers with equal amount of values or a multitude of values

Answers (1)

Related Questions