Miguel Lambelho
Miguel Lambelho

Reputation: 334

Expanding dataframe with specific frequency - Python

I have a dataframe with the following columns (just an excerpt):

     START                    END               FREQ     VARIABLE    
'2017-03-26 16:55:00'  '2017-10-28 16:55:00'   1234567      x
'2017-03-26 20:35:00'  '2017-10-28 20:35:00'   1234567      y
'2017-03-26 14:55:00'  '2017-10-28 14:55:00'   ..3.567      y
'2017-03-26 11:15:00'  '2017-10-28 11:15:00'   1234567      y
'2017-03-26 09:30:00'  '2017-06-11 09:30:00'   ......7      x

My goal is to create a new dataframe that expands this one by producing daily rows that start on the 'START' date and finish on the 'END' date, according to the 'FREQ' column. In this 'FREQ' column, 1 = Monday, 7 = Sunday. A 'dot' denotes that the row should not be created in that specific day of the week. So, ..3.5.7 will only correspond to 3 new rows on Wednesday, Friday and Sunday only. The 'VARIABLE' column should always have the same value for each of the created rows.

My major problem is that the new dataframe will have millions of rows, therefore, I was looking for a really efficient solution.

The dataframe written in Python code:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.array([ 
'2017-03-26 16:55:00','2017-10-28 16:55:00', '1234567', 'x',
'2017-03-26 20:35:00','2017-10-28 20:35:00','1234567','y',
'2017-03-26 14:55:00','2017-10-28 14:55:00','..3.567','y',
'2017-03-26 11:15:00','2017-10-28 11:15:00','1234567','y',
'2017-03-26 09:30:00','2017-06-11 09:30:00','......7','x']).reshape((5, 4)))
df.columns = ['START','END','FREQ','VARIABLE']

Upvotes: 2

Views: 495

Answers (2)

b2002
b2002

Reputation: 914

Revised answer:

This uses pandas iloc and numpy repeat to make a new dataframe from the original dataframe index, but after determining the repeat indexes based on the date ranges and valid weekdays.

import pandas as pd
import numpy as np

df_arr = np.array([ 
    '2017-03-26 16:55:00', '2017-10-28 16:55:00', '1234567', 'x',
    '2017-03-26 20:35:00', '2017-10-28 20:35:00', '1234567', 'y',
    '2017-03-26 14:55:00', '2017-10-28 14:55:00', '..3.567', 'y',
    '2017-03-26 11:15:00', '2017-10-28 11:15:00', '1234567', 'y',
    '2017-03-26 09:30:00', '2017-06-11 09:30:00', '......7',' x'])

df = pd.DataFrame(df_arr.reshape(5, 4),
                  columns=['START', 'END', 'FREQ', 'VARIABLE'])

def get_weekdays_dates_repeats(start, end, valid_weekday_nums):
    date_range = pd.date_range(start, end, freq="D", normalize=True)
    all_day_nums = date_range.dayofweek.values + 1
    filtered_idx = np.where(np.isin(all_day_nums, valid_weekday_nums))
    day_nums = all_day_nums[filtered_idx]
    dates = date_range[filtered_idx]
    return day_nums, dates.values.astype('<M8[D]'), day_nums.size

starts = df.START.values
ends = df.END.values
freqs = df.FREQ.str.replace('.', '').values

repeats = np.zeros(len(df))
weekdays_arr_list = []
dates_arr_list = []
for i in range(len(df)):
    valid_day_nums = [int(s) for s in list(freqs[i])]
    days, dates, repeat = \
        get_weekdays_dates_repeats(starts[i], ends[i], valid_day_nums)
    weekdays_arr_list.append(days)
    dates_arr_list.append(dates)
    repeats[i] = repeat

weekday_col = np.concatenate(weekdays_arr_list)
dates_col = np.concatenate(dates_arr_list)
repeats = repeats.astype(int)

df2 = df.iloc[np.repeat(df.index.values, repeats)].reset_index(drop=True)

df2['day_num'] = weekday_col
df2['date'] = dates_col

df2.head()

                  START        END          FREQ    VARIABLE    day_num date
0   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   7   2017-03-26
1   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   1   2017-03-27
2   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   2   2017-03-28
3   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   3   2017-03-29
4   2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x   4   2017-03-30

df2.tail()

                  START                END  FREQ    VARIABLE    day_num date
782 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-05-14
783 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-05-21
784 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-05-28
785 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-06-04
786 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x   7   2017-06-11

Upvotes: 1

Chris Adams
Chris Adams

Reputation: 18647

Here is one potential way to achieve this using DataFrame.iterrows():

chunks = []
for _, row in df.iterrows():
    freq = [int(i)-1 for i in str(row[2]).replace('.', '')]
    dateidx = [d for d in pd.date_range(row[0], row[1], freq='D') if d.weekday() in freq]
    chunks.append(pd.DataFrame({'date': dateidx}).assign(variable=row[3]))
df_expanded = pd.concat(chunks, ignore_index=True)

Upvotes: 1

Related Questions