Slartibartfast
Slartibartfast

Reputation: 1190

Resampled data not matching in Pandas

Not able to get the same results with resampled data. For example:

import yfinance as yf
import pandas as pd
df = yf.download('f', interval = '1mo')
df = df[~df.index.duplicated(keep='last')]
# df = df.resample('CM').mean()
df['pct'] = df['Close'].pct_change()

print(df)

Produces this DataFrame

             Open    High    Low  Close  Adj Close        Volume       pct
Date                                                                      
1972-10-25    NaN     NaN    NaN    NaN        NaN           NaN       NaN
1973-01-23    NaN     NaN    NaN    NaN        NaN           NaN       NaN
1973-07-24    NaN     NaN    NaN    NaN        NaN           NaN       NaN
1973-10-23    NaN     NaN    NaN    NaN        NaN           NaN       NaN
1975-04-24    NaN     NaN    NaN    NaN        NaN           NaN       NaN
          ...     ...    ...    ...        ...           ...       ...
2022-03-01  17.41  18.250  15.51  16.91  16.798235  1.749488e+09 -0.037016
2022-04-01  17.01  17.150  13.90  14.16  14.066411  1.341340e+09 -0.162626
2022-05-01  14.02  15.000  12.07  13.68  13.680000  1.359644e+09 -0.033898
2022-06-01  13.88  13.970  13.36  13.50  13.500000  1.372803e+08 -0.013158
2022-06-03  13.63  13.775  13.36  13.50  13.500000  4.362525e+07  0.000000

I have resampled a daily data with 'MS' calendar month begin. This is so the resample data matches the default data we get using interval = '1mo'

df = yf.download('f')
df = df[~df.index.duplicated(keep='last')]
df = df.resample('MS').mean()
df['pct'] = df['Close'].pct_change()

print(df) 

And this produces

Date                                         ...                                   
1972-06-01   2.015902   2.127047   2.098477  ...   0.267396  1.545780e+06       NaN
1972-07-01   2.053871   2.071511   2.039882  ...   0.260849  1.061645e+06 -0.027089
1972-08-01   2.155512   2.171732   2.140879  ...   0.276273  1.683096e+06  0.050896
1972-09-01   2.124429   2.136392   2.110845  ...   0.271796  9.511339e+05 -0.016205
1972-10-01   2.118384   2.128706   2.106587  ...   0.271777  1.152049e+06 -0.002440
              ...        ...        ...  ...        ...           ...       ...
2022-02-01  18.114211  18.490000  17.770526  ...  18.019060  9.611196e+07 -0.194397
2022-03-01  16.770435  17.036521  16.379565  ...  16.583577  7.606470e+07 -0.079665
2022-04-01  15.530500  15.796500  15.176500  ...  15.341886  6.706698e+07 -0.076340
2022-05-01  13.387619  13.679048  13.084762  ...  13.387619  6.474497e+07 -0.131773
2022-06-01  13.716667  13.903333  13.453333  ...  13.646667  4.576010e+07  0.019350

[601 rows x 7 columns]

When comparing same date in these two DataFrames one can see data is not matching. I am especially interested with the last column which is pct_change() as they are way off.

2022-05-01  14.02  15.000  12.07  13.68  13.680000  1.359644e+09 -0.03389

By specifying interval = '1mo'

2022-05-01  13.387619  13.679048  13.084762  ...  13.387619  6.474497e+07 -0.131773

By way of resampling data from daily to Monthly.

I was earlier under the impression that it was due to the fact that the resampled data is sampled at the end of the month compared to Yahoo data which starts at the beginning of the month. However even with resampling at start of calendar month, data is not matching. Could you please advise why is the resampled data not matching the monthly data?

Upvotes: 0

Views: 311

Answers (1)

Alfredo Maussa
Alfredo Maussa

Reputation: 545

First data

You are taking different sampling, in the first:

df = yf.download('f', interval = '1mo')

You are taking one month candle, the open price is for the first day 01/07/2022 while the close price is the last day from that month, for instance 07/06/2022, today.

Second data

df = yf.download('f')

Although in the pypi page says the default interval is "1mo", I have tested it and appears to be "1d" (business days).

So, the mean is calculated for all the days in the corresponding month. This line is averaging the close price for all the days of the month. Which usually is undesired.

df2.resample('MS').mean()

Test

df2 = yf.download('f')
df2 = df2[~df2.index.duplicated(keep='last')]
# df2 = df2.resample('MS').first()
# df2['pct'] = df2['Close'].pct_change()
df2.tail(10)

Here I use ...agg.first() to track what is the first value of the aggregation data. enter image description here

Then print more rows for the df2 (with default interval): enter image description here

Notice it match with the lowest low value of the months' days and take the last close price (yesterday).

Solution

Be aware of defining the interval and take in mind than the mean of the month days doesn't relate with the candle prices for a month. You should use min(low) and last(close) instead of mean(low) and mean(close).

df2.resample('MS').agg({'Open':'first','High':'max','Low': 'min','Close':'last'})

Upvotes: 1

Related Questions