Skenderbeu Arber
Skenderbeu Arber

Reputation: 13

Inserting rows in specific location using pandas

I have a CSV-file containing the following data structure:

2015-01-02,09:30:00,64.815
2015-01-02,09:35:00,64.8741
2015-01-02,09:55:00,65.0255
2015-01-02,10:00:00,64.9269

By using Pandas in Python, I would like to quadruple the 2nd row and insert the new rows after the 2nd row (filling up the missing intervals with the 2nd row). Eventually, it should look like:

2015-01-02,09:30:00,64.815
2015-01-02,09:35:00,64.8741
2015-01-02,09:40:00,64.8741
2015-01-02,09:45:00,64.8741
2015-01-02,09:50:00,64.8741
2015-01-02,09:55:00,65.0255
2015-01-02,10:00:00,64.9269
2015-01-02,10:05:00,64.815

I have the following code:

    df = pd.read_csv("csv.file", header=0, names=['date', 'minute', 'price'])

    for i in range(len(df)):
        if i != len(df)-1:
            next_i = i+1
            if df.loc[next_i, 'date'] == df.loc[i, 'date'] and df.loc[i, 'minute'] != "16:00:00":
                now = int(df.loc[i, "minute"][:2]+df.loc[i, "minute"][3:5])
                future = int(df.loc[next_i, "minute"][:2]+df.loc[next_i, "minute"][3:5])
                
                while now + 5 != future and df.loc[next_i, "minute"][3:5] != "00" and df.loc[next_i, "minute"][3:5] != "60":
                    newminutes = str(int(df.loc[i, "minute"][3:5])+5*a)
                    newtime = df.loc[next_i, "minute"][:2] +":"+newminutes+":00"
                    df.loc[next_i-0.5] = [df.loc[next_i, 'date'], newtime , df.loc[i, 'price']]
                    df = df.sort_index().reset_index(drop=True)
                    now = int(newtime[:2]+newtime[3:5])
                    future = int(df.loc[next_i+1, "minute"][:2]+df.loc[next_i+1, "minute"][3:5])

However, it's not working.

Upvotes: 1

Views: 250

Answers (2)

Nev1111
Nev1111

Reputation: 1049

Try pandas merge_ordered function.

Create the original data frame:

data = {
    'date' : ['2015-01-02', '2015-01-02', '2015-01-02', '2015-01-02'],
    'time' : ['09:30:00', '09:35:00', '09:55:00', '10:00:00'],
    'val' : [64.815, 64.8741, 65.0255, 64.9269]
    }
df = pd.DataFrame(data)
df['datetime']=pd.to_datetime(df['date']+' '+df['time'])

Create a second data frame df2 with 5 minute time intervals from min to max of df1

df2=pd.DataFrame(pd.date_range(df['datetime'].min(), df['datetime'].max(), freq='5 min').rename('datetime'))

Using panda's merge_ordered function:

result=pd.merge_ordered(df2,df, on='datetime',how='left')
result['date']=result['datetime'].dt.date
result['time']=result['datetime'].dt.time
result['val']=result['val'].ffill()
result=result.drop('datetime', axis=1)

Upvotes: 0

Himanshu Poddar
Himanshu Poddar

Reputation: 7789

I see there is an extra row in the expected output 2015-01-02,10:05:00,64.815. To accomodate that as well you can reindex using pd.DateRange.

Creating data
data = {
    'date' : ['2015-01-02', '2015-01-02', '2015-01-02', '2015-01-02'],
    'time' : ['09:30:00', '09:35:00', '09:55:00', '10:00:00'],
    'val' : [64.815, 64.8741, 65.0255, 64.9269]
    }
df = pd.DataFrame(data)

Creating datetime column for reindexing

df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])
df.set_index('datetime', inplace=True)

Generating output

df = df.resample('5min').asfreq().reindex(pd.date_range('2015-01-02 09:30:00', '2015-01-02 10:05:00', freq='5 min')).ffill()
df[['date', 'time']] = df.index.astype(str).to_series().str.split(' ', expand=True).values
df.reset_index(drop=True)

Output

This gives us the expected output

         date      time      val
0  2015-01-02  09:30:00  64.8150
1  2015-01-02  09:35:00  64.8741
2  2015-01-02  09:40:00  64.8741
3  2015-01-02  09:45:00  64.8741
4  2015-01-02  09:50:00  64.8741
5  2015-01-02  09:55:00  65.0255
6  2015-01-02  10:00:00  64.9269
7  2015-01-02  10:05:00  64.9269

However if that was a typo and you don't want the last row you can do this :

df = df.resample('5min').asfreq().reindex(pd.date_range(df.index[0], df.index[len(df)-1], freq='5 min')).ffill()
df[['date', 'time']] = df.index.astype(str).to_series().str.split(' ', expand=True).values
df.reset_index(drop=True)

which gives is

         date      time      val
0  2015-01-02  09:30:00  64.8150
1  2015-01-02  09:35:00  64.8741
2  2015-01-02  09:40:00  64.8741
3  2015-01-02  09:45:00  64.8741
4  2015-01-02  09:50:00  64.8741
5  2015-01-02  09:55:00  65.0255
6  2015-01-02  10:00:00  64.9269

Upvotes: 1

Related Questions