Reputation: 51
I have a stock data grabbed from Yahoo finance, adjusted close data is wrong somehow.
adj_close close ratio
date
2014-10-16 240.4076 2466.40 0.097473
2014-10-17 245.8173 2521.90 0.097473
2014-10-20 250.4522 2569.45 0.097473
2014-10-21 251.8850 2584.15 0.097473
2014-10-22 251.0175 2575.25 0.097473
2014-10-23 251.3392 2578.55 0.097473
2014-10-27 253.2155 2597.80 0.097473
2014-10-28 258.9616 2656.75 0.097473
2014-10-29 257.6944 2643.75 0.097473
2014-10-30 257.1339 2638.00 0.097473
2014-10-31 26.3450 2702.80 0.009747
2014-11-03 26.5463 2723.45 0.009747
2014-11-05 27.1160 2781.90 0.009747
2014-11-07 26.7320 2742.50 0.009747
2014-11-10 26.7027 2739.50 0.009747
Here's a plot of the adjusted close data:
How can I replace data like this using any methods like interpolation or something?
Upvotes: 5
Views: 4755
Reputation: 210852
Try this:
In [71]: import pandas_datareader.data as web
In [110]: df = web.DataReader('SBIN.NS', 'yahoo', '2014-10-21', '2014-11-25')
In [111]: df
Out[111]:
Open High Low Close Volume Adj Close
Date
2014-10-21 2580.0000 2607.0001 2569.5999 2584.1501 15022300 251.8850
2014-10-22 2608.9999 2613.5999 2565.1001 2575.2499 14511100 251.0175
2014-10-23 2591.4001 2593.7000 2573.9999 2578.5501 2376200 251.3392
2014-10-24 2578.5501 2578.5501 2578.5501 2578.5501 0 251.3392
2014-10-27 2592.0001 2619.8999 2581.0001 2597.8000 13429500 253.2155
2014-10-28 2607.9999 2664.2999 2606.0001 2656.7499 22963400 258.9616
2014-10-29 2677.0001 2678.9999 2631.0001 2643.7500 17372900 257.6944
2014-10-30 2649.8999 2653.0499 2622.0001 2637.9999 15544200 257.1339
2014-10-31 265.2000 270.9800 264.6000 270.2800 20770200 26.3450 # <bad_data>
2014-11-03 270.6000 274.3500 269.4250 272.3450 17780600 26.5463
2014-11-04 272.3450 272.3450 272.3450 272.3450 0 26.5463
2014-11-05 273.3000 279.9800 272.4050 278.1900 26605100 27.1160
2014-11-06 278.1900 278.1900 278.1900 278.1900 0 27.1160
2014-11-07 277.5000 278.1000 273.0000 274.2500 18163000 26.7320
2014-11-10 275.9000 276.9000 273.3000 273.9500 12068800 26.7027
2014-11-11 274.7900 276.2500 270.5000 274.0350 17405900 26.7110
2014-11-12 275.3000 277.1500 273.5550 274.6050 16233200 26.7666
2014-11-13 275.6100 276.2250 269.5000 271.9300 16859000 26.5059
2014-11-14 273.0000 280.6900 272.0000 278.7850 50846600 27.1740
2014-11-17 279.4000 295.1300 279.2200 294.0600 49164100 28.6629
2014-11-18 295.6950 297.9000 292.4100 294.5750 32898300 28.7131
2014-11-19 294.9000 296.8000 290.3550 291.0500 20735900 28.3695 # </bad_data>
2014-11-20 294.7500 298.7500 291.2500 297.1000 18099500 289.5925
2014-11-21 299.9000 307.0000 297.2500 305.5000 21009200 297.7802
2014-11-24 307.8000 309.8500 306.0500 308.8500 18631400 301.0456
2014-11-25 309.9000 309.9500 301.0000 304.4500 26776600 296.7568
NOTE: Adj Close
column has recovered starting from 2014-11-20
, other columns - not, so i'll concentrate on Adj Close
only:
let's find outliers (i'm checking for those that have changed for 50+%
from the previous day - you may want to change this threshold):
In [112]: bad_idx = df.index[df['Adj Close'].pct_change().abs().ge(0.5)]
In [113]: bad_idx
Out[113]: DatetimeIndex(['2014-10-31', '2014-11-20'], dtype='datetime64[ns]', name='Date', freq=None)
In [114]: df.loc[(df.index >= bad_idx.min()) & (df.index < bad_idx.max()), 'Adj Close'] *= 10
In [115]: df
Out[115]:
Open High Low Close Volume Adj Close
Date
2014-10-21 2580.0000 2607.0001 2569.5999 2584.1501 15022300 251.8850
2014-10-22 2608.9999 2613.5999 2565.1001 2575.2499 14511100 251.0175
2014-10-23 2591.4001 2593.7000 2573.9999 2578.5501 2376200 251.3392
2014-10-24 2578.5501 2578.5501 2578.5501 2578.5501 0 251.3392
2014-10-27 2592.0001 2619.8999 2581.0001 2597.8000 13429500 253.2155
2014-10-28 2607.9999 2664.2999 2606.0001 2656.7499 22963400 258.9616
2014-10-29 2677.0001 2678.9999 2631.0001 2643.7500 17372900 257.6944
2014-10-30 2649.8999 2653.0499 2622.0001 2637.9999 15544200 257.1339
2014-10-31 265.2000 270.9800 264.6000 270.2800 20770200 263.4500
2014-11-03 270.6000 274.3500 269.4250 272.3450 17780600 265.4630
2014-11-04 272.3450 272.3450 272.3450 272.3450 0 265.4630
2014-11-05 273.3000 279.9800 272.4050 278.1900 26605100 271.1600
2014-11-06 278.1900 278.1900 278.1900 278.1900 0 271.1600
2014-11-07 277.5000 278.1000 273.0000 274.2500 18163000 267.3200
2014-11-10 275.9000 276.9000 273.3000 273.9500 12068800 267.0270
2014-11-11 274.7900 276.2500 270.5000 274.0350 17405900 267.1100
2014-11-12 275.3000 277.1500 273.5550 274.6050 16233200 267.6660
2014-11-13 275.6100 276.2250 269.5000 271.9300 16859000 265.0590
2014-11-14 273.0000 280.6900 272.0000 278.7850 50846600 271.7400
2014-11-17 279.4000 295.1300 279.2200 294.0600 49164100 286.6290
2014-11-18 295.6950 297.9000 292.4100 294.5750 32898300 287.1310
2014-11-19 294.9000 296.8000 290.3550 291.0500 20735900 283.6950
2014-11-20 294.7500 298.7500 291.2500 297.1000 18099500 289.5925
2014-11-21 299.9000 307.0000 297.2500 305.5000 21009200 297.7802
2014-11-24 307.8000 309.8500 306.0500 308.8500 18631400 301.0456
2014-11-25 309.9000 309.9500 301.0000 304.4500 26776600 296.7568
Here is another solution, which uses interpolation:
In [119]: df.loc[(df.index >= bad_idx.min()) & (df.index < bad_idx.max()), 'Adj Close'] = np.nan
In [120]: df
Out[120]:
Open High Low Close Volume Adj Close
Date
2014-10-21 2580.0000 2607.0001 2569.5999 2584.1501 15022300 251.8850
2014-10-22 2608.9999 2613.5999 2565.1001 2575.2499 14511100 251.0175
2014-10-23 2591.4001 2593.7000 2573.9999 2578.5501 2376200 251.3392
2014-10-24 2578.5501 2578.5501 2578.5501 2578.5501 0 251.3392
2014-10-27 2592.0001 2619.8999 2581.0001 2597.8000 13429500 253.2155
2014-10-28 2607.9999 2664.2999 2606.0001 2656.7499 22963400 258.9616
2014-10-29 2677.0001 2678.9999 2631.0001 2643.7500 17372900 257.6944
2014-10-30 2649.8999 2653.0499 2622.0001 2637.9999 15544200 257.1339
2014-10-31 265.2000 270.9800 264.6000 270.2800 20770200 NaN
2014-11-03 270.6000 274.3500 269.4250 272.3450 17780600 NaN
2014-11-04 272.3450 272.3450 272.3450 272.3450 0 NaN
2014-11-05 273.3000 279.9800 272.4050 278.1900 26605100 NaN
2014-11-06 278.1900 278.1900 278.1900 278.1900 0 NaN
2014-11-07 277.5000 278.1000 273.0000 274.2500 18163000 NaN
2014-11-10 275.9000 276.9000 273.3000 273.9500 12068800 NaN
2014-11-11 274.7900 276.2500 270.5000 274.0350 17405900 NaN
2014-11-12 275.3000 277.1500 273.5550 274.6050 16233200 NaN
2014-11-13 275.6100 276.2250 269.5000 271.9300 16859000 NaN
2014-11-14 273.0000 280.6900 272.0000 278.7850 50846600 NaN
2014-11-17 279.4000 295.1300 279.2200 294.0600 49164100 NaN
2014-11-18 295.6950 297.9000 292.4100 294.5750 32898300 NaN
2014-11-19 294.9000 296.8000 290.3550 291.0500 20735900 NaN
2014-11-20 294.7500 298.7500 291.2500 297.1000 18099500 289.5925
2014-11-21 299.9000 307.0000 297.2500 305.5000 21009200 297.7802
2014-11-24 307.8000 309.8500 306.0500 308.8500 18631400 301.0456
2014-11-25 309.9000 309.9500 301.0000 304.4500 26776600 296.7568
In [122]: df['Adj Close'] = df['Adj Close'].interpolate()
In [123]: df
Out[123]:
Open High Low Close Volume Adj Close
Date
2014-10-21 2580.0000 2607.0001 2569.5999 2584.1501 15022300 251.885000
2014-10-22 2608.9999 2613.5999 2565.1001 2575.2499 14511100 251.017500
2014-10-23 2591.4001 2593.7000 2573.9999 2578.5501 2376200 251.339200
2014-10-24 2578.5501 2578.5501 2578.5501 2578.5501 0 251.339200
2014-10-27 2592.0001 2619.8999 2581.0001 2597.8000 13429500 253.215500
2014-10-28 2607.9999 2664.2999 2606.0001 2656.7499 22963400 258.961600
2014-10-29 2677.0001 2678.9999 2631.0001 2643.7500 17372900 257.694400
2014-10-30 2649.8999 2653.0499 2622.0001 2637.9999 15544200 257.133900
2014-10-31 265.2000 270.9800 264.6000 270.2800 20770200 259.297807
2014-11-03 270.6000 274.3500 269.4250 272.3450 17780600 261.461713
2014-11-04 272.3450 272.3450 272.3450 272.3450 0 263.625620
2014-11-05 273.3000 279.9800 272.4050 278.1900 26605100 265.789527
2014-11-06 278.1900 278.1900 278.1900 278.1900 0 267.953433
2014-11-07 277.5000 278.1000 273.0000 274.2500 18163000 270.117340
2014-11-10 275.9000 276.9000 273.3000 273.9500 12068800 272.281247
2014-11-11 274.7900 276.2500 270.5000 274.0350 17405900 274.445153
2014-11-12 275.3000 277.1500 273.5550 274.6050 16233200 276.609060
2014-11-13 275.6100 276.2250 269.5000 271.9300 16859000 278.772967
2014-11-14 273.0000 280.6900 272.0000 278.7850 50846600 280.936873
2014-11-17 279.4000 295.1300 279.2200 294.0600 49164100 283.100780
2014-11-18 295.6950 297.9000 292.4100 294.5750 32898300 285.264687
2014-11-19 294.9000 296.8000 290.3550 291.0500 20735900 287.428593
2014-11-20 294.7500 298.7500 291.2500 297.1000 18099500 289.592500
2014-11-21 299.9000 307.0000 297.2500 305.5000 21009200 297.780200
2014-11-24 307.8000 309.8500 306.0500 308.8500 18631400 301.045600
2014-11-25 309.9000 309.9500 301.0000 304.4500 26776600 296.756800
Upvotes: 6
Reputation: 249213
The problem is not exactly the data...the problem is you haven't understood the market fundamentals at play here. So, rather than approaching this as a math problem ("replace outliers"), we should look at it as a data sourcing/cleaning problem (fix the data).
The ticker you're looking at is SBIN.NS (State Bank of India). It had a 1:10 split on 2014-11-21, as reported here: http://articles.economictimes.indiatimes.com/2014-11-20/news/56304010_1_india-gains-state-bank-stock-split
The graph you've shown clearly indicates that something went wrong in the Yahoo data around that date.
So what happened?
The first discontinuity is on 2014-10-31, when Yahoo shows the price falling to 1:10. This is pretty clearly a bug. I would guess that their automated corporate actions parser got the notification of the pending 1:10 split on that date, and applied it immediately rather than the record date of 2014-11-21. So from 2014-10-31 to 2014-11-20 inclusive, you have the wrong price by a factor of 10.
The best solution in this case, unless you can get data from another source, is to simply note down this error and multiply the price from Yahoo by 10 for the erroneous dates.
Upvotes: 1