Reputation:
I would like to write a python script that will check if there is any missing day. If there is it should take the price from the latest day and create a new day in data. I mean something like shown below. My data is in CSV files. Any ideas how it can be done?
Before:
MSFT,5-Jun-07,259.16
MSFT,3-Jun-07,253.28
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,29-May-07,243.31
After:
MSFT,5-Jun-07,259.16
MSFT,4-Jun-07,253.28
MSFT,3-Jun-07,253.28
MSFT,2-Jun-07,249.95
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,30-May-07,243.31
MSFT,29-May-07,243.31
My solution:
import pandas as pd
df = pd.read_csv("path/to/file/file.csv",names=list("abc")) # read string as file
cols = df.columns # store column order
df.b = pd.to_datetime(df.b) # convert col Date to datetime
df.set_index("b",inplace=True) # set col Date as index
df = df.resample("D").ffill().reset_index() # resample Days and fill values
df = df[cols] # revert order
df.sort_values(by="b",ascending=False,inplace=True) # sort by date
df["b"] = df["b"].dt.strftime("%-d-%b-%y") # revert date format
df.to_csv("data.csv",index=False,header=False) #specify outputfile if needed
print(df.to_string())
Upvotes: 0
Views: 779
Reputation: 18916
Using pandas library this operation can be made on a single line. But first we need to read in your data to the right formats:
import io
import pandas as pd
s = u"""name,Date,Close
MSFT,30-Dec-16,771.82
MSFT,29-Dec-16,782.79
MSFT,28-Dec-16,785.05
MSFT,27-Dec-16,791.55
MSFT,23-Dec-16,789.91
MSFT,16-Dec-16,790.8
MSFT,15-Dec-16,797.85
MSFT,14-Dec-16,797.07"""
#df = pd.read_csv("path/to/file.csv") # read from file
df = pd.read_csv(io.StringIO(s)) # read string as file
cols = df.columns # store column order
df.Date = pd.to_datetime(df.Date) # convert col Date to datetime
df.set_index("Date",inplace=True) # set col Date as index
df = df.resample("D").ffill().reset_index() # resample Days and fill values
df
Returns:
Date name Close
0 2016-12-14 MSFT 797.07
1 2016-12-15 MSFT 797.85
2 2016-12-16 MSFT 790.80
3 2016-12-17 MSFT 790.80
4 2016-12-18 MSFT 790.80
5 2016-12-19 MSFT 790.80
6 2016-12-20 MSFT 790.80
7 2016-12-21 MSFT 790.80
8 2016-12-22 MSFT 790.80
9 2016-12-23 MSFT 789.91
10 2016-12-24 MSFT 789.91
11 2016-12-25 MSFT 789.91
12 2016-12-26 MSFT 789.91
13 2016-12-27 MSFT 791.55
14 2016-12-28 MSFT 785.05
15 2016-12-29 MSFT 782.79
16 2016-12-30 MSFT 771.82
Return back to csv with:
df = df[cols] # revert order
df.sort_values(by="Date",ascending=False,inplace=True) # sort by date
df["Date"] = df["Date"].dt.strftime("%-d-%b-%y") # revert date format
df.to_csv(index=False,header=False) #specify outputfile if needed
Output:
MSFT,30-Dec-16,771.82
MSFT,29-Dec-16,782.79
MSFT,28-Dec-16,785.05
MSFT,27-Dec-16,791.55
MSFT,26-Dec-16,789.91
MSFT,25-Dec-16,789.91
MSFT,24-Dec-16,789.91
MSFT,23-Dec-16,789.91
...
Upvotes: 1
Reputation: 21643
This code uses standard routines.
from datetime import datetime, timedelta
Input lines will have to be split on commas, and the dates parsed in two places in the main part of the code. I have therefore put this work in a single function.
def glean(s):
msft, date_part, amount = s.split(',')
if date_part.find('-')==1:
date_part = '0'+date_part
date = datetime.strptime(date_part, '%d-%b-%y')
return date, amount
Similarly, dates will have to be formatted for output with other pieces of data in a number of places in the main code.
def out(date,amount):
date_str = date.strftime('%d-%b-%y')
print(('%s,%s,%s' % ('MSFT', date_str, amount)).replace('MSFT,0', 'MSFT,'))
with open('before.txt') as before:
I read the initial line of data on its own to establish the first date for comparison with date in the next line.
previous_date, previous_amount = glean(before.readline().strip())
out(previous_date, previous_amount)
for line in before.readlines():
date, amount = glean(line.strip())
I calculate the elapsed time between the current line and the previous line, to know how many lines to output in place of missing lines.
elapsed = previous_date - date
setting_date
is decremented from previous_date
for the number of days that elapsed without data. One line is omitted for each day, if there were any.
setting_date = previous_date
for i in range(-1+elapsed.days):
setting_date -= timedelta(days=1)
out(setting_date, previous_amount)
Now the available line of data is output.
out(date, amount)
Now previous_date
and previous_amount
are reset to reflect the new values, for use against the next line of data, if any.
previous_date, previous_amount = date, amount
Output:
MSFT,5-Jun-07,259.16
MSFT,4-Jun-07,259.16
MSFT,3-Jun-07,253.28
MSFT,2-Jun-07,253.28
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,30-May-07,248.71
MSFT,29-May-07,243.31
Upvotes: 0
Reputation: 152
To do this, you would need to iterate through your dataframe using nested for loops. That would look something like:
for column in df:
for row in df:
do_something()
To give you an idea, the
do_something()
part of your code would probably be something like checking if there was a gap between the dates. Then you would copy the other columns from the row above and insert a new row using:
df.loc[row] = [2, 3, 4] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort() # sorting by index
Hope this helped give you an idea of how you would solve this. Let me know if you want some more code!
Upvotes: 0