freak11
freak11

Reputation: 391

How to resample ohlc data properly in pandas / custom fill method per column

I have got OHLC data with missing time frames. Suppose I have the following pandas dataframe denoted by the variable df:

                     Open     High     Low      Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223

Now, I resample that pandas dataframe to fill the missing gap and I get the following:

df = df.resample('T').ffill()

                     Open     High     Low      Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:01:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223

From the above, we can see that the missing gap (00:01:00) is filled with the help of ffill(). However, the data in that row (row starting with 00:01:00) is not displayed properly as the opening price should be the same as the closing price of the previous row (row starting with 00:00:00). Likewise, the closing price of that row (row starting with 00:01:00) should be the same as the opening price of the next row (row starting with 00:02:00). The desired output should look like this:

                     Open     High     Low      Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:01:00  0.67123  0.67123  0.67064  0.67062
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223

How would I resolve this problem in pandas?

Upvotes: 2

Views: 1637

Answers (2)

Bik
Bik

Reputation: 41

The answer by @mozway is the right direction, but it has two issues:

  1. It will not work if there are more than one consecutive rows of missing values,
  2. The filling of "High" and "Low" are not logical in the OHLC context.

The below is what I would do (after resampling):

  1. Cross-fill the first missing row

    result = df.copy()
    result['open'] = df['open'].fillna(df['close'].ffill(),limit=1)
    result['close'] = df['close'].fillna(df['open'].bfill(),limit=1)
    
  2. Now, propagate equal values of open and close through the rest of the rows:

    result['open'] = result['open'].fillna(result['close'].ffill())
    result['close'] = result['close'].fillna(result['close'].ffill())
    
  3. Finally, set high and low logically as min and max of open and close values:

    result['high'] = result['high'].fillna(result[['open','close']].max(axis=1))
    result['low'] = result['low'].fillna((result[['open','close']].min(axis=1)))
    

This method will work in situations with multiple consecutive missing rows and also treats missing values for 'high' and 'low' logically.

Upvotes: 4

mozway
mozway

Reputation: 260965

Unfortunately, you can't directly specify a fill method per column.

A workaround would be not to fill the values during the resampling but to do it afterwards:

df = df.resample('T').fillna(None)

df['Open'], df['Close'] = (df['Open'].fillna(df['Close'].ffill()),
                           df['Close'].fillna(df['Open'].bfill()))

df = df.ffill()

output:

                        Open     High      Low    Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:01:00  0.67123  0.67123  0.67064  0.67062
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223

previous answer (incorrect for OHLC but interesting as generalization)

(df.resample('T')
   .fillna(None)
   .assign(Close=lambda d: d['Close'].bfill())  # bfill for Close
   .ffill()                                     # ffill for others
)

output:

                        Open     High      Low    Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:01:00  0.67068  0.67123  0.67064  0.67223
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223
cross filling the values:
(df.resample('T')
   .fillna(None)
   .assign(Open=lambda d: d['Open'].fillna(d['Close'].ffill())) # Open = last Close
   .ffill()  # ffill the others
)

output:

                        Open     High      Low    Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:01:00  0.67123  0.67123  0.67064  0.67123
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223
more options

Here is another example where we'll interpolate High and leave Low as NaNs:

(df.resample('T')
   .fillna(None)
   .assign(Open=lambda d: d['Open'].ffill(),
           Close=lambda d: d['Close'].bfill(),
           High=lambda d: d['High'].interpolate()
          )
)

output:

                        Open     High      Low    Close
2019-04-19 00:00:00  0.67068  0.67123  0.67064  0.67123
2019-04-19 00:01:00  0.67068  0.67274      NaN  0.67223
2019-04-19 00:02:00  0.67062  0.67425  0.67060  0.67223

Upvotes: 2

Related Questions