Akash Chandra
Akash Chandra

Reputation: 51

How to replace outlier data in pandas?

I have a stock data grabbed from Yahoo finance, adjusted close data is wrong somehow.

            adj_close    close     ratio
date                                    
2014-10-16   240.4076  2466.40  0.097473
2014-10-17   245.8173  2521.90  0.097473
2014-10-20   250.4522  2569.45  0.097473
2014-10-21   251.8850  2584.15  0.097473
2014-10-22   251.0175  2575.25  0.097473
2014-10-23   251.3392  2578.55  0.097473
2014-10-27   253.2155  2597.80  0.097473
2014-10-28   258.9616  2656.75  0.097473
2014-10-29   257.6944  2643.75  0.097473
2014-10-30   257.1339  2638.00  0.097473
2014-10-31    26.3450  2702.80  0.009747
2014-11-03    26.5463  2723.45  0.009747
2014-11-05    27.1160  2781.90  0.009747
2014-11-07    26.7320  2742.50  0.009747
2014-11-10    26.7027  2739.50  0.009747

Here's a plot of the adjusted close data:

plot of ad_close data

How can I replace data like this using any methods like interpolation or something?

Upvotes: 5

Views: 4755

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210852

Try this:

In [71]: import pandas_datareader.data as web

In [110]: df = web.DataReader('SBIN.NS', 'yahoo', '2014-10-21', '2014-11-25')

In [111]: df
Out[111]:
                 Open       High        Low      Close    Volume  Adj Close
Date
2014-10-21  2580.0000  2607.0001  2569.5999  2584.1501  15022300   251.8850
2014-10-22  2608.9999  2613.5999  2565.1001  2575.2499  14511100   251.0175
2014-10-23  2591.4001  2593.7000  2573.9999  2578.5501   2376200   251.3392
2014-10-24  2578.5501  2578.5501  2578.5501  2578.5501         0   251.3392
2014-10-27  2592.0001  2619.8999  2581.0001  2597.8000  13429500   253.2155
2014-10-28  2607.9999  2664.2999  2606.0001  2656.7499  22963400   258.9616
2014-10-29  2677.0001  2678.9999  2631.0001  2643.7500  17372900   257.6944
2014-10-30  2649.8999  2653.0499  2622.0001  2637.9999  15544200   257.1339
2014-10-31   265.2000   270.9800   264.6000   270.2800  20770200    26.3450   # <bad_data>
2014-11-03   270.6000   274.3500   269.4250   272.3450  17780600    26.5463
2014-11-04   272.3450   272.3450   272.3450   272.3450         0    26.5463
2014-11-05   273.3000   279.9800   272.4050   278.1900  26605100    27.1160
2014-11-06   278.1900   278.1900   278.1900   278.1900         0    27.1160
2014-11-07   277.5000   278.1000   273.0000   274.2500  18163000    26.7320
2014-11-10   275.9000   276.9000   273.3000   273.9500  12068800    26.7027
2014-11-11   274.7900   276.2500   270.5000   274.0350  17405900    26.7110
2014-11-12   275.3000   277.1500   273.5550   274.6050  16233200    26.7666
2014-11-13   275.6100   276.2250   269.5000   271.9300  16859000    26.5059
2014-11-14   273.0000   280.6900   272.0000   278.7850  50846600    27.1740
2014-11-17   279.4000   295.1300   279.2200   294.0600  49164100    28.6629
2014-11-18   295.6950   297.9000   292.4100   294.5750  32898300    28.7131
2014-11-19   294.9000   296.8000   290.3550   291.0500  20735900    28.3695   # </bad_data>
2014-11-20   294.7500   298.7500   291.2500   297.1000  18099500   289.5925
2014-11-21   299.9000   307.0000   297.2500   305.5000  21009200   297.7802
2014-11-24   307.8000   309.8500   306.0500   308.8500  18631400   301.0456
2014-11-25   309.9000   309.9500   301.0000   304.4500  26776600   296.7568

NOTE: Adj Close column has recovered starting from 2014-11-20, other columns - not, so i'll concentrate on Adj Close only:

let's find outliers (i'm checking for those that have changed for 50+% from the previous day - you may want to change this threshold):

In [112]: bad_idx = df.index[df['Adj Close'].pct_change().abs().ge(0.5)]

In [113]: bad_idx
Out[113]: DatetimeIndex(['2014-10-31', '2014-11-20'], dtype='datetime64[ns]', name='Date', freq=None)

In [114]: df.loc[(df.index >= bad_idx.min()) & (df.index < bad_idx.max()), 'Adj Close'] *= 10

In [115]: df
Out[115]:
                 Open       High        Low      Close    Volume  Adj Close
Date
2014-10-21  2580.0000  2607.0001  2569.5999  2584.1501  15022300   251.8850
2014-10-22  2608.9999  2613.5999  2565.1001  2575.2499  14511100   251.0175
2014-10-23  2591.4001  2593.7000  2573.9999  2578.5501   2376200   251.3392
2014-10-24  2578.5501  2578.5501  2578.5501  2578.5501         0   251.3392
2014-10-27  2592.0001  2619.8999  2581.0001  2597.8000  13429500   253.2155
2014-10-28  2607.9999  2664.2999  2606.0001  2656.7499  22963400   258.9616
2014-10-29  2677.0001  2678.9999  2631.0001  2643.7500  17372900   257.6944
2014-10-30  2649.8999  2653.0499  2622.0001  2637.9999  15544200   257.1339
2014-10-31   265.2000   270.9800   264.6000   270.2800  20770200   263.4500
2014-11-03   270.6000   274.3500   269.4250   272.3450  17780600   265.4630
2014-11-04   272.3450   272.3450   272.3450   272.3450         0   265.4630
2014-11-05   273.3000   279.9800   272.4050   278.1900  26605100   271.1600
2014-11-06   278.1900   278.1900   278.1900   278.1900         0   271.1600
2014-11-07   277.5000   278.1000   273.0000   274.2500  18163000   267.3200
2014-11-10   275.9000   276.9000   273.3000   273.9500  12068800   267.0270
2014-11-11   274.7900   276.2500   270.5000   274.0350  17405900   267.1100
2014-11-12   275.3000   277.1500   273.5550   274.6050  16233200   267.6660
2014-11-13   275.6100   276.2250   269.5000   271.9300  16859000   265.0590
2014-11-14   273.0000   280.6900   272.0000   278.7850  50846600   271.7400
2014-11-17   279.4000   295.1300   279.2200   294.0600  49164100   286.6290
2014-11-18   295.6950   297.9000   292.4100   294.5750  32898300   287.1310
2014-11-19   294.9000   296.8000   290.3550   291.0500  20735900   283.6950
2014-11-20   294.7500   298.7500   291.2500   297.1000  18099500   289.5925
2014-11-21   299.9000   307.0000   297.2500   305.5000  21009200   297.7802
2014-11-24   307.8000   309.8500   306.0500   308.8500  18631400   301.0456
2014-11-25   309.9000   309.9500   301.0000   304.4500  26776600   296.7568

Here is another solution, which uses interpolation:

In [119]: df.loc[(df.index >= bad_idx.min()) & (df.index < bad_idx.max()), 'Adj Close'] = np.nan

In [120]: df
Out[120]:
                 Open       High        Low      Close    Volume  Adj Close
Date
2014-10-21  2580.0000  2607.0001  2569.5999  2584.1501  15022300   251.8850
2014-10-22  2608.9999  2613.5999  2565.1001  2575.2499  14511100   251.0175
2014-10-23  2591.4001  2593.7000  2573.9999  2578.5501   2376200   251.3392
2014-10-24  2578.5501  2578.5501  2578.5501  2578.5501         0   251.3392
2014-10-27  2592.0001  2619.8999  2581.0001  2597.8000  13429500   253.2155
2014-10-28  2607.9999  2664.2999  2606.0001  2656.7499  22963400   258.9616
2014-10-29  2677.0001  2678.9999  2631.0001  2643.7500  17372900   257.6944
2014-10-30  2649.8999  2653.0499  2622.0001  2637.9999  15544200   257.1339
2014-10-31   265.2000   270.9800   264.6000   270.2800  20770200        NaN
2014-11-03   270.6000   274.3500   269.4250   272.3450  17780600        NaN
2014-11-04   272.3450   272.3450   272.3450   272.3450         0        NaN
2014-11-05   273.3000   279.9800   272.4050   278.1900  26605100        NaN
2014-11-06   278.1900   278.1900   278.1900   278.1900         0        NaN
2014-11-07   277.5000   278.1000   273.0000   274.2500  18163000        NaN
2014-11-10   275.9000   276.9000   273.3000   273.9500  12068800        NaN
2014-11-11   274.7900   276.2500   270.5000   274.0350  17405900        NaN
2014-11-12   275.3000   277.1500   273.5550   274.6050  16233200        NaN
2014-11-13   275.6100   276.2250   269.5000   271.9300  16859000        NaN
2014-11-14   273.0000   280.6900   272.0000   278.7850  50846600        NaN
2014-11-17   279.4000   295.1300   279.2200   294.0600  49164100        NaN
2014-11-18   295.6950   297.9000   292.4100   294.5750  32898300        NaN
2014-11-19   294.9000   296.8000   290.3550   291.0500  20735900        NaN
2014-11-20   294.7500   298.7500   291.2500   297.1000  18099500   289.5925
2014-11-21   299.9000   307.0000   297.2500   305.5000  21009200   297.7802
2014-11-24   307.8000   309.8500   306.0500   308.8500  18631400   301.0456
2014-11-25   309.9000   309.9500   301.0000   304.4500  26776600   296.7568

In [122]: df['Adj Close'] = df['Adj Close'].interpolate()

In [123]: df
Out[123]:
                 Open       High        Low      Close    Volume   Adj Close
Date
2014-10-21  2580.0000  2607.0001  2569.5999  2584.1501  15022300  251.885000
2014-10-22  2608.9999  2613.5999  2565.1001  2575.2499  14511100  251.017500
2014-10-23  2591.4001  2593.7000  2573.9999  2578.5501   2376200  251.339200
2014-10-24  2578.5501  2578.5501  2578.5501  2578.5501         0  251.339200
2014-10-27  2592.0001  2619.8999  2581.0001  2597.8000  13429500  253.215500
2014-10-28  2607.9999  2664.2999  2606.0001  2656.7499  22963400  258.961600
2014-10-29  2677.0001  2678.9999  2631.0001  2643.7500  17372900  257.694400
2014-10-30  2649.8999  2653.0499  2622.0001  2637.9999  15544200  257.133900
2014-10-31   265.2000   270.9800   264.6000   270.2800  20770200  259.297807
2014-11-03   270.6000   274.3500   269.4250   272.3450  17780600  261.461713
2014-11-04   272.3450   272.3450   272.3450   272.3450         0  263.625620
2014-11-05   273.3000   279.9800   272.4050   278.1900  26605100  265.789527
2014-11-06   278.1900   278.1900   278.1900   278.1900         0  267.953433
2014-11-07   277.5000   278.1000   273.0000   274.2500  18163000  270.117340
2014-11-10   275.9000   276.9000   273.3000   273.9500  12068800  272.281247
2014-11-11   274.7900   276.2500   270.5000   274.0350  17405900  274.445153
2014-11-12   275.3000   277.1500   273.5550   274.6050  16233200  276.609060
2014-11-13   275.6100   276.2250   269.5000   271.9300  16859000  278.772967
2014-11-14   273.0000   280.6900   272.0000   278.7850  50846600  280.936873
2014-11-17   279.4000   295.1300   279.2200   294.0600  49164100  283.100780
2014-11-18   295.6950   297.9000   292.4100   294.5750  32898300  285.264687
2014-11-19   294.9000   296.8000   290.3550   291.0500  20735900  287.428593
2014-11-20   294.7500   298.7500   291.2500   297.1000  18099500  289.592500
2014-11-21   299.9000   307.0000   297.2500   305.5000  21009200  297.780200
2014-11-24   307.8000   309.8500   306.0500   308.8500  18631400  301.045600
2014-11-25   309.9000   309.9500   301.0000   304.4500  26776600  296.756800

Upvotes: 6

John Zwinck
John Zwinck

Reputation: 249213

The problem is not exactly the data...the problem is you haven't understood the market fundamentals at play here. So, rather than approaching this as a math problem ("replace outliers"), we should look at it as a data sourcing/cleaning problem (fix the data).

The ticker you're looking at is SBIN.NS (State Bank of India). It had a 1:10 split on 2014-11-21, as reported here: http://articles.economictimes.indiatimes.com/2014-11-20/news/56304010_1_india-gains-state-bank-stock-split

The graph you've shown clearly indicates that something went wrong in the Yahoo data around that date.

So what happened?

The first discontinuity is on 2014-10-31, when Yahoo shows the price falling to 1:10. This is pretty clearly a bug. I would guess that their automated corporate actions parser got the notification of the pending 1:10 split on that date, and applied it immediately rather than the record date of 2014-11-21. So from 2014-10-31 to 2014-11-20 inclusive, you have the wrong price by a factor of 10.

The best solution in this case, unless you can get data from another source, is to simply note down this error and multiply the price from Yahoo by 10 for the erroneous dates.

Upvotes: 1

Related Questions