nivk
nivk

Reputation: 685

Conditionally replace values in pandas.DataFrame with previous value

I need to filter outliers in a dataset. Replacing the outlier with the previous value in the column makes the most sense in my application.

I was having considerable difficulty doing this with the pandas tools available (mostly to do with copies on slices, or type conversions occurring when setting to NaN).

Is there a fast and/or memory efficient way to do this? (Please see my answer below for the solution I am currently using, which also has limitations.)


A simple example:

>>> import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,1000,6,7,8],'B':list('abcdefgh')})
>>> df
      A  B
0     1  a
1     2  b
2     3  c
3     4  d
4  1000  e # '1000  e' --> '4  e'
5     6  f
6     7  g
7     8  h

Upvotes: 2

Views: 1635

Answers (2)

user3483203
user3483203

Reputation: 51165

You can simply mask values over your threshold and use ffill:

df.assign(A=df.A.mask(df.A.gt(10)).ffill())

     A  B
0  1.0  a
1  2.0  b
2  3.0  c
3  4.0  d
4  4.0  e
5  6.0  f
6  7.0  g
7  8.0  h

Using mask is necessary rather than something like shift, because it guarantees non-outlier output in the case that the previous value is also above a threshold.

Upvotes: 2

nivk
nivk

Reputation: 685

I circumvented some of the issues with pandas copies and slices by converting to a numpy array first, doing the operations there, and then re-inserting the column. I'm not certain, but as far as I can tell, the datatype is the same once it is put back into the pandas.DataFrame.

def df_replace_with_previous(df,col,maskfunc,inplace=False):
    arr = np.array(df[col])
    mask = maskfunc(arr)
    arr[ mask ] = arr[ list(mask)[1:]+[False] ]
    if inplace:
        df[col] = arr
        return
    else:
        df2 = df.copy()
        df2[col] = arr
        return df2

This creates a mask, shifts it down by one so that the True values point at the previous entry, and updates the array. Of course, this will need to run recursively if there are multiple adjacent outliers (N times if there are N consecutive outliers), which is not ideal.

Usage in the case given in OP:

df_replace_with_previous(df,'A',lambda x:x>10,False)

Upvotes: 0

Related Questions