Expanding dataframe with specific frequency - Python

Question

I have a dataframe with the following columns (just an excerpt):

     START                    END               FREQ     VARIABLE    
'2017-03-26 16:55:00'  '2017-10-28 16:55:00'   1234567      x
'2017-03-26 20:35:00'  '2017-10-28 20:35:00'   1234567      y
'2017-03-26 14:55:00'  '2017-10-28 14:55:00'   ..3.567      y
'2017-03-26 11:15:00'  '2017-10-28 11:15:00'   1234567      y
'2017-03-26 09:30:00'  '2017-06-11 09:30:00'   ......7      x

My goal is to create a new dataframe that expands this one by producing daily rows that start on the 'START' date and finish on the 'END' date, according to the 'FREQ' column. In this 'FREQ' column, 1 = Monday, 7 = Sunday. A 'dot' denotes that the row should not be created in that specific day of the week. So, ..3.5.7 will only correspond to 3 new rows on Wednesday, Friday and Sunday only. The 'VARIABLE' column should always have the same value for each of the created rows.

My major problem is that the new dataframe will have millions of rows, therefore, I was looking for a really efficient solution.

The dataframe written in Python code:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.array([ 
'2017-03-26 16:55:00','2017-10-28 16:55:00', '1234567', 'x',
'2017-03-26 20:35:00','2017-10-28 20:35:00','1234567','y',
'2017-03-26 14:55:00','2017-10-28 14:55:00','..3.567','y',
'2017-03-26 11:15:00','2017-10-28 11:15:00','1234567','y',
'2017-03-26 09:30:00','2017-06-11 09:30:00','......7','x']).reshape((5, 4)))
df.columns = ['START','END','FREQ','VARIABLE']

b2002 · Accepted Answer

Revised answer:

This uses pandas iloc and numpy repeat to make a new dataframe from the original dataframe index, but after determining the repeat indexes based on the date ranges and valid weekdays.

import pandas as pd
import numpy as np

df_arr = np.array([ 
    '2017-03-26 16:55:00', '2017-10-28 16:55:00', '1234567', 'x',
    '2017-03-26 20:35:00', '2017-10-28 20:35:00', '1234567', 'y',
    '2017-03-26 14:55:00', '2017-10-28 14:55:00', '..3.567', 'y',
    '2017-03-26 11:15:00', '2017-10-28 11:15:00', '1234567', 'y',
    '2017-03-26 09:30:00', '2017-06-11 09:30:00', '......7',' x'])

df = pd.DataFrame(df_arr.reshape(5, 4),
                  columns=['START', 'END', 'FREQ', 'VARIABLE'])

def get_weekdays_dates_repeats(start, end, valid_weekday_nums):
    date_range = pd.date_range(start, end, freq="D", normalize=True)
    all_day_nums = date_range.dayofweek.values + 1
    filtered_idx = np.where(np.isin(all_day_nums, valid_weekday_nums))
    day_nums = all_day_nums[filtered_idx]
    dates = date_range[filtered_idx]
    return day_nums, dates.values.astype('



df2.head()

                  START        END          FREQ    VARIABLE    day_num date
0   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   7   2017-03-26
1   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   1   2017-03-27
2   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   2   2017-03-28
3   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   3   2017-03-29
4   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   4   2017-03-30


df2.tail()

                  START                END  FREQ    VARIABLE    day_num date
782 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-05-14
783 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-05-21
784 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-05-28
785 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-06-04
786 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-06-11

Expanding dataframe with specific frequency - Python

Answers (2)

Related Questions