gibbz00
gibbz00

Reputation: 1987

Ignoring multiple NaNs when calculating standard deviation

I have the following pandas dataframe which contains some 5 minute intraday data. DeltaBetweenClose is NaN for the first trade bar in the day when market opens(9:30 EST).

    time     Date       symbol    DeltaBetweenClose 
    9:35    2017-07-17  spy        NaN 
    9:40    2017-07-17  spy       -1.2                     
    ..........................................
    ..........................................    
    16:00   2018-07-17  spy        1.7
    9:35    2017-07-18  spy        NaN
    9:40    2017-07-18  spy        0.3                      
    ..........................................
    ..........................................        
    9:35    2018-07-17  nflx       NaN

I am trying to create a column CloseDelta_sd that calculates a rolling standard deviation of DeltaBetweenClose column grouped by symbols that looks into the prior 30 bars and calculates standard deviation while ignoring NaNs. My following attempt returns all NaNs. It works when there is only one NaN at the top of the DeltaBetweenClose column.

df['CloseDelta_sd'] = df.groupby('symbol').DeltaBetweenClose.transform(lambda x: x.rolling(30).std())

Upvotes: 4

Views: 2751

Answers (1)

user3483203
user3483203

Reputation: 51165

The issue isn't with std, since that skips NaN by default, but instead with rolling.

You need to make use of the min_periods parameter:

Minimum number of observations in window required to have a value (otherwise result is NA). For a window that is specified by an offset, this will default to 1.

Since you provide an int instead of an offset, you will end up with a lot of NaN groups, the fix is simple:

(df.groupby('symbol').DeltaBetweenClose
    .transform(lambda x: x.rolling(30, min_periods=1).std()))

Upvotes: 3

Related Questions