Reputation: 13
I have a CSV-file containing the following data structure:
2015-01-02,09:30:00,64.815
2015-01-02,09:35:00,64.8741
2015-01-02,09:55:00,65.0255
2015-01-02,10:00:00,64.9269
By using Pandas in Python, I would like to quadruple the 2nd row and insert the new rows after the 2nd row (filling up the missing intervals with the 2nd row). Eventually, it should look like:
2015-01-02,09:30:00,64.815
2015-01-02,09:35:00,64.8741
2015-01-02,09:40:00,64.8741
2015-01-02,09:45:00,64.8741
2015-01-02,09:50:00,64.8741
2015-01-02,09:55:00,65.0255
2015-01-02,10:00:00,64.9269
2015-01-02,10:05:00,64.815
I have the following code:
df = pd.read_csv("csv.file", header=0, names=['date', 'minute', 'price'])
for i in range(len(df)):
if i != len(df)-1:
next_i = i+1
if df.loc[next_i, 'date'] == df.loc[i, 'date'] and df.loc[i, 'minute'] != "16:00:00":
now = int(df.loc[i, "minute"][:2]+df.loc[i, "minute"][3:5])
future = int(df.loc[next_i, "minute"][:2]+df.loc[next_i, "minute"][3:5])
while now + 5 != future and df.loc[next_i, "minute"][3:5] != "00" and df.loc[next_i, "minute"][3:5] != "60":
newminutes = str(int(df.loc[i, "minute"][3:5])+5*a)
newtime = df.loc[next_i, "minute"][:2] +":"+newminutes+":00"
df.loc[next_i-0.5] = [df.loc[next_i, 'date'], newtime , df.loc[i, 'price']]
df = df.sort_index().reset_index(drop=True)
now = int(newtime[:2]+newtime[3:5])
future = int(df.loc[next_i+1, "minute"][:2]+df.loc[next_i+1, "minute"][3:5])
However, it's not working.
Upvotes: 1
Views: 250
Reputation: 1049
Try pandas merge_ordered
function.
Create the original data frame:
data = {
'date' : ['2015-01-02', '2015-01-02', '2015-01-02', '2015-01-02'],
'time' : ['09:30:00', '09:35:00', '09:55:00', '10:00:00'],
'val' : [64.815, 64.8741, 65.0255, 64.9269]
}
df = pd.DataFrame(data)
df['datetime']=pd.to_datetime(df['date']+' '+df['time'])
Create a second data frame df2
with 5 minute time intervals from min to max of df1
df2=pd.DataFrame(pd.date_range(df['datetime'].min(), df['datetime'].max(), freq='5 min').rename('datetime'))
Using panda's merge_ordered
function:
result=pd.merge_ordered(df2,df, on='datetime',how='left')
result['date']=result['datetime'].dt.date
result['time']=result['datetime'].dt.time
result['val']=result['val'].ffill()
result=result.drop('datetime', axis=1)
Upvotes: 0
Reputation: 7789
I see there is an extra row in the expected output 2015-01-02,10:05:00,64.815
.
To accomodate that as well you can reindex using pd.DateRange
.
data = {
'date' : ['2015-01-02', '2015-01-02', '2015-01-02', '2015-01-02'],
'time' : ['09:30:00', '09:35:00', '09:55:00', '10:00:00'],
'val' : [64.815, 64.8741, 65.0255, 64.9269]
}
df = pd.DataFrame(data)
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])
df.set_index('datetime', inplace=True)
df = df.resample('5min').asfreq().reindex(pd.date_range('2015-01-02 09:30:00', '2015-01-02 10:05:00', freq='5 min')).ffill()
df[['date', 'time']] = df.index.astype(str).to_series().str.split(' ', expand=True).values
df.reset_index(drop=True)
This gives us the expected output
date time val
0 2015-01-02 09:30:00 64.8150
1 2015-01-02 09:35:00 64.8741
2 2015-01-02 09:40:00 64.8741
3 2015-01-02 09:45:00 64.8741
4 2015-01-02 09:50:00 64.8741
5 2015-01-02 09:55:00 65.0255
6 2015-01-02 10:00:00 64.9269
7 2015-01-02 10:05:00 64.9269
However if that was a typo and you don't want the last row you can do this :
df = df.resample('5min').asfreq().reindex(pd.date_range(df.index[0], df.index[len(df)-1], freq='5 min')).ffill()
df[['date', 'time']] = df.index.astype(str).to_series().str.split(' ', expand=True).values
df.reset_index(drop=True)
which gives is
date time val
0 2015-01-02 09:30:00 64.8150
1 2015-01-02 09:35:00 64.8741
2 2015-01-02 09:40:00 64.8741
3 2015-01-02 09:45:00 64.8741
4 2015-01-02 09:50:00 64.8741
5 2015-01-02 09:55:00 65.0255
6 2015-01-02 10:00:00 64.9269
Upvotes: 1