Reputation: 391
I have got OHLC data with missing time frames. Suppose I have the following pandas dataframe denoted by the variable df:
Open High Low Close
2019-04-19 00:00:00 0.67068 0.67123 0.67064 0.67123
2019-04-19 00:02:00 0.67062 0.67425 0.67060 0.67223
Now, I resample that pandas dataframe to fill the missing gap and I get the following:
df = df.resample('T').ffill()
Open High Low Close
2019-04-19 00:00:00 0.67068 0.67123 0.67064 0.67123
2019-04-19 00:01:00 0.67068 0.67123 0.67064 0.67123
2019-04-19 00:02:00 0.67062 0.67425 0.67060 0.67223
From the above, we can see that the missing gap (00:01:00) is filled with the help of ffill(). However, the data in that row (row starting with 00:01:00) is not displayed properly as the opening price should be the same as the closing price of the previous row (row starting with 00:00:00). Likewise, the closing price of that row (row starting with 00:01:00) should be the same as the opening price of the next row (row starting with 00:02:00). The desired output should look like this:
Open High Low Close
2019-04-19 00:00:00 0.67068 0.67123 0.67064 0.67123
2019-04-19 00:01:00 0.67123 0.67123 0.67064 0.67062
2019-04-19 00:02:00 0.67062 0.67425 0.67060 0.67223
How would I resolve this problem in pandas?
Upvotes: 2
Views: 1637
Reputation: 41
The answer by @mozway is the right direction, but it has two issues:
The below is what I would do (after resampling):
Cross-fill the first missing row
result = df.copy()
result['open'] = df['open'].fillna(df['close'].ffill(),limit=1)
result['close'] = df['close'].fillna(df['open'].bfill(),limit=1)
Now, propagate equal values of open and close through the rest of the rows:
result['open'] = result['open'].fillna(result['close'].ffill())
result['close'] = result['close'].fillna(result['close'].ffill())
Finally, set high and low logically as min and max of open and close values:
result['high'] = result['high'].fillna(result[['open','close']].max(axis=1))
result['low'] = result['low'].fillna((result[['open','close']].min(axis=1)))
This method will work in situations with multiple consecutive missing rows and also treats missing values for 'high' and 'low' logically.
Upvotes: 4
Reputation: 260965
Unfortunately, you can't directly specify a fill method per column.
A workaround would be not to fill the values during the resampling but to do it afterwards:
df = df.resample('T').fillna(None)
df['Open'], df['Close'] = (df['Open'].fillna(df['Close'].ffill()),
df['Close'].fillna(df['Open'].bfill()))
df = df.ffill()
output:
Open High Low Close
2019-04-19 00:00:00 0.67068 0.67123 0.67064 0.67123
2019-04-19 00:01:00 0.67123 0.67123 0.67064 0.67062
2019-04-19 00:02:00 0.67062 0.67425 0.67060 0.67223
(df.resample('T')
.fillna(None)
.assign(Close=lambda d: d['Close'].bfill()) # bfill for Close
.ffill() # ffill for others
)
output:
Open High Low Close
2019-04-19 00:00:00 0.67068 0.67123 0.67064 0.67123
2019-04-19 00:01:00 0.67068 0.67123 0.67064 0.67223
2019-04-19 00:02:00 0.67062 0.67425 0.67060 0.67223
(df.resample('T')
.fillna(None)
.assign(Open=lambda d: d['Open'].fillna(d['Close'].ffill())) # Open = last Close
.ffill() # ffill the others
)
output:
Open High Low Close
2019-04-19 00:00:00 0.67068 0.67123 0.67064 0.67123
2019-04-19 00:01:00 0.67123 0.67123 0.67064 0.67123
2019-04-19 00:02:00 0.67062 0.67425 0.67060 0.67223
Here is another example where we'll interpolate High and leave Low as NaNs:
(df.resample('T')
.fillna(None)
.assign(Open=lambda d: d['Open'].ffill(),
Close=lambda d: d['Close'].bfill(),
High=lambda d: d['High'].interpolate()
)
)
output:
Open High Low Close
2019-04-19 00:00:00 0.67068 0.67123 0.67064 0.67123
2019-04-19 00:01:00 0.67068 0.67274 NaN 0.67223
2019-04-19 00:02:00 0.67062 0.67425 0.67060 0.67223
Upvotes: 2