Reputation: 334
I have a dataframe with the following columns (just an excerpt):
START END FREQ VARIABLE
'2017-03-26 16:55:00' '2017-10-28 16:55:00' 1234567 x
'2017-03-26 20:35:00' '2017-10-28 20:35:00' 1234567 y
'2017-03-26 14:55:00' '2017-10-28 14:55:00' ..3.567 y
'2017-03-26 11:15:00' '2017-10-28 11:15:00' 1234567 y
'2017-03-26 09:30:00' '2017-06-11 09:30:00' ......7 x
My goal is to create a new dataframe that expands this one by producing daily rows that start on the 'START' date and finish on the 'END' date, according to the 'FREQ' column. In this 'FREQ' column, 1 = Monday, 7 = Sunday. A 'dot' denotes that the row should not be created in that specific day of the week. So, ..3.5.7 will only correspond to 3 new rows on Wednesday, Friday and Sunday only. The 'VARIABLE' column should always have the same value for each of the created rows.
My major problem is that the new dataframe will have millions of rows, therefore, I was looking for a really efficient solution.
The dataframe written in Python code:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([
'2017-03-26 16:55:00','2017-10-28 16:55:00', '1234567', 'x',
'2017-03-26 20:35:00','2017-10-28 20:35:00','1234567','y',
'2017-03-26 14:55:00','2017-10-28 14:55:00','..3.567','y',
'2017-03-26 11:15:00','2017-10-28 11:15:00','1234567','y',
'2017-03-26 09:30:00','2017-06-11 09:30:00','......7','x']).reshape((5, 4)))
df.columns = ['START','END','FREQ','VARIABLE']
Upvotes: 2
Views: 495
Reputation: 914
Revised answer:
This uses pandas iloc and numpy repeat to make a new dataframe from the original dataframe index, but after determining the repeat indexes based on the date ranges and valid weekdays.
import pandas as pd
import numpy as np
df_arr = np.array([
'2017-03-26 16:55:00', '2017-10-28 16:55:00', '1234567', 'x',
'2017-03-26 20:35:00', '2017-10-28 20:35:00', '1234567', 'y',
'2017-03-26 14:55:00', '2017-10-28 14:55:00', '..3.567', 'y',
'2017-03-26 11:15:00', '2017-10-28 11:15:00', '1234567', 'y',
'2017-03-26 09:30:00', '2017-06-11 09:30:00', '......7',' x'])
df = pd.DataFrame(df_arr.reshape(5, 4),
columns=['START', 'END', 'FREQ', 'VARIABLE'])
def get_weekdays_dates_repeats(start, end, valid_weekday_nums):
date_range = pd.date_range(start, end, freq="D", normalize=True)
all_day_nums = date_range.dayofweek.values + 1
filtered_idx = np.where(np.isin(all_day_nums, valid_weekday_nums))
day_nums = all_day_nums[filtered_idx]
dates = date_range[filtered_idx]
return day_nums, dates.values.astype('<M8[D]'), day_nums.size
starts = df.START.values
ends = df.END.values
freqs = df.FREQ.str.replace('.', '').values
repeats = np.zeros(len(df))
weekdays_arr_list = []
dates_arr_list = []
for i in range(len(df)):
valid_day_nums = [int(s) for s in list(freqs[i])]
days, dates, repeat = \
get_weekdays_dates_repeats(starts[i], ends[i], valid_day_nums)
weekdays_arr_list.append(days)
dates_arr_list.append(dates)
repeats[i] = repeat
weekday_col = np.concatenate(weekdays_arr_list)
dates_col = np.concatenate(dates_arr_list)
repeats = repeats.astype(int)
df2 = df.iloc[np.repeat(df.index.values, repeats)].reset_index(drop=True)
df2['day_num'] = weekday_col
df2['date'] = dates_col
df2.head()
START END FREQ VARIABLE day_num date
0 2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x 7 2017-03-26
1 2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x 1 2017-03-27
2 2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x 2 2017-03-28
3 2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x 3 2017-03-29
4 2017-03-26 16:55:00 2017-10-28 16:55:00 1234567 x 4 2017-03-30
df2.tail()
START END FREQ VARIABLE day_num date
782 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x 7 2017-05-14
783 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x 7 2017-05-21
784 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x 7 2017-05-28
785 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x 7 2017-06-04
786 2017-03-26 09:30:00 2017-06-11 09:30:00 ......7 x 7 2017-06-11
Upvotes: 1
Reputation: 18647
Here is one potential way to achieve this using DataFrame.iterrows()
:
chunks = []
for _, row in df.iterrows():
freq = [int(i)-1 for i in str(row[2]).replace('.', '')]
dateidx = [d for d in pd.date_range(row[0], row[1], freq='D') if d.weekday() in freq]
chunks.append(pd.DataFrame({'date': dateidx}).assign(variable=row[3]))
df_expanded = pd.concat(chunks, ignore_index=True)
Upvotes: 1