Reputation:

How to add automatically data to the missing days in historical stock prices?

I would like to write a python script that will check if there is any missing day. If there is it should take the price from the latest day and create a new day in data. I mean something like shown below. My data is in CSV files. Any ideas how it can be done?

Before:

MSFT,5-Jun-07,259.16
MSFT,3-Jun-07,253.28
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,29-May-07,243.31

After:

MSFT,5-Jun-07,259.16
MSFT,4-Jun-07,253.28
MSFT,3-Jun-07,253.28
MSFT,2-Jun-07,249.95
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,30-May-07,243.31
MSFT,29-May-07,243.31

My solution:

import pandas as pd
df = pd.read_csv("path/to/file/file.csv",names=list("abc")) # read string as file


cols = df.columns # store column order
df.b = pd.to_datetime(df.b) # convert col Date to datetime
df.set_index("b",inplace=True) # set col Date as index
df = df.resample("D").ffill().reset_index() # resample Days and fill values

df = df[cols] # revert order
df.sort_values(by="b",ascending=False,inplace=True) # sort by date
df["b"] = df["b"].dt.strftime("%-d-%b-%y") # revert date format
df.to_csv("data.csv",index=False,header=False) #specify outputfile if needed

print(df.to_string())

Upvotes: 0

Answers (3)

Anton vBR

Reputation: 18916

Using pandas library this operation can be made on a single line. But first we need to read in your data to the right formats:

import io
import pandas as pd

s = u"""name,Date,Close
MSFT,30-Dec-16,771.82
MSFT,29-Dec-16,782.79
MSFT,28-Dec-16,785.05
MSFT,27-Dec-16,791.55
MSFT,23-Dec-16,789.91
MSFT,16-Dec-16,790.8
MSFT,15-Dec-16,797.85
MSFT,14-Dec-16,797.07"""

#df = pd.read_csv("path/to/file.csv") # read from file
df = pd.read_csv(io.StringIO(s)) # read string as file

cols = df.columns # store column order
df.Date = pd.to_datetime(df.Date) # convert col Date to datetime
df.set_index("Date",inplace=True) # set col Date as index
df = df.resample("D").ffill().reset_index() # resample Days and fill values

df

Returns:

         Date  name   Close
0  2016-12-14  MSFT  797.07
1  2016-12-15  MSFT  797.85
2  2016-12-16  MSFT  790.80
3  2016-12-17  MSFT  790.80
4  2016-12-18  MSFT  790.80
5  2016-12-19  MSFT  790.80
6  2016-12-20  MSFT  790.80
7  2016-12-21  MSFT  790.80
8  2016-12-22  MSFT  790.80
9  2016-12-23  MSFT  789.91
10 2016-12-24  MSFT  789.91
11 2016-12-25  MSFT  789.91
12 2016-12-26  MSFT  789.91
13 2016-12-27  MSFT  791.55
14 2016-12-28  MSFT  785.05
15 2016-12-29  MSFT  782.79
16 2016-12-30  MSFT  771.82

Return back to csv with:

df = df[cols] # revert order
df.sort_values(by="Date",ascending=False,inplace=True) # sort by date
df["Date"] = df["Date"].dt.strftime("%-d-%b-%y") # revert date format
df.to_csv(index=False,header=False) #specify outputfile if needed

Output:

MSFT,30-Dec-16,771.82
MSFT,29-Dec-16,782.79
MSFT,28-Dec-16,785.05
MSFT,27-Dec-16,791.55
MSFT,26-Dec-16,789.91
MSFT,25-Dec-16,789.91
MSFT,24-Dec-16,789.91
MSFT,23-Dec-16,789.91
...

Upvotes: 1

Bill Bell

Reputation: 21643

This code uses standard routines.

from datetime import datetime, timedelta

Input lines will have to be split on commas, and the dates parsed in two places in the main part of the code. I have therefore put this work in a single function.

def glean(s):
    msft, date_part, amount = s.split(',')
    if date_part.find('-')==1: 
        date_part = '0'+date_part
    date = datetime.strptime(date_part, '%d-%b-%y')
    return date, amount

Similarly, dates will have to be formatted for output with other pieces of data in a number of places in the main code.

def out(date,amount):
    date_str = date.strftime('%d-%b-%y')
    print(('%s,%s,%s' % ('MSFT', date_str, amount)).replace('MSFT,0', 'MSFT,'))

with open('before.txt') as before:

I read the initial line of data on its own to establish the first date for comparison with date in the next line.

previous_date, previous_amount = glean(before.readline().strip())
out(previous_date, previous_amount)
for line in before.readlines():
    date, amount = glean(line.strip())

I calculate the elapsed time between the current line and the previous line, to know how many lines to output in place of missing lines.

    elapsed = previous_date - date

setting_date is decremented from previous_date for the number of days that elapsed without data. One line is omitted for each day, if there were any.

    setting_date = previous_date
    for i in range(-1+elapsed.days):
        setting_date -= timedelta(days=1)
        out(setting_date, previous_amount)

Now the available line of data is output.

    out(date, amount)

Now previous_date and previous_amount are reset to reflect the new values, for use against the next line of data, if any.

    previous_date, previous_amount = date, amount

Output:

MSFT,5-Jun-07,259.16
MSFT,4-Jun-07,259.16
MSFT,3-Jun-07,253.28
MSFT,2-Jun-07,253.28
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,30-May-07,248.71
MSFT,29-May-07,243.31

Upvotes: 0

jaguar

Reputation: 152

To do this, you would need to iterate through your dataframe using nested for loops. That would look something like:

for column in df:
    for row in df:
        do_something()

To give you an idea, the

do_something()

part of your code would probably be something like checking if there was a gap between the dates. Then you would copy the other columns from the row above and insert a new row using:

df.loc[row] = [2, 3, 4]  # adding a row
df.index = df.index + 1  # shifting index
df = df.sort()  # sorting by index

Hope this helped give you an idea of how you would solve this. Let me know if you want some more code!

Upvotes: 0

How to add automatically data to the missing days in historical stock prices?

Answers (3)

Related Questions