Arshad Nawaz
Arshad Nawaz

Reputation: 45

Python script | long running | Need suggestions to optimize

I have written this script to generate a dataset which would contain 15 minute time intervals based on the inputs provided for operational hours for all days of a week for 365 days.

example: Let us say Store 1 opens at 9 AM and closes at 9 PM on all days. That is 12 hours everyday. 12*4 = 48(15 minute periods a day). 48 * 365 = 17520 (15 minute periods for a year).

The sample dataset only contains 5 sites but there are about 9000 sites that this script needs to generate data for.

The script obviously runs for a handful of sites(100) and couple of days(2) but needs to run for sites(9000) and 365 days.

Looking for suggestions to make this run faster. This will be running on a local machine.

input data: https://drive.google.com/open?id=1uLYRUsJ2vM-TIGPvt5RhHDhTq3vr4V2y

output data: https://drive.google.com/open?id=13MZCQXfVDLBLFbbmmVagIJtm6LFDOk_T

Please let me know if I can help with anything more to get this answered.

def datetime_range(start, end, delta):
current = start
while current < end:
    yield current
    current += delta

import pandas as pd
import numpy as np
import cProfile
from datetime import timedelta, date, datetime

#inputs
empty_data = pd.DataFrame(columns=['store','timestamp'])
start_dt = date(2019, 1, 1)
days = 365

data = "input data | attached to the post"
for i in range(days):
for j in range(len(data.store)):
curr_date = start_dt + timedelta(days=i)  
curr_date_year = curr_date.year
curr_date_month = curr_date.month
curr_date_day =  curr_date.day
weekno = curr_date.weekday()
if weekno<5:
  dts = [dt.strftime('%Y-%m-%d %H:%M') for dt in 
  datetime_range(datetime(curr_date_year,curr_date_month,curr_date_day,data['m_f_open_hrs'].iloc[j],data['m_f_open_min'].iloc[j]), datetime(curr_date_year,curr_date_month,curr_date_day, data['m_f_close_hrs'].iloc[j],data['m_f_close_min'].iloc[j]), 
  timedelta(minutes=15))]
  vert = pd.DataFrame(dts,columns = ['timestamp']) 
  vert['store']= data['store'].iloc[j]
  empty_data = pd.concat([vert, empty_data])
elif weekno==5:
  dts = [dt.strftime('%Y-%m-%d %H:%M') for dt in 
  datetime_range(datetime(curr_date_year,curr_date_month,curr_date_day,data['sat_open_hrs'].iloc[j],data['sat_open_min'].iloc[j]), datetime(curr_date_year,curr_date_month,curr_date_day, data['sat_close_hrs'].iloc[j],data['sat_close_min'].iloc[j]), 
  timedelta(minutes=15))]
  vert = pd.DataFrame(dts,columns = ['timestamp'])
  vert['store']= data['store'].iloc[j]
  empty_data = pd.concat([vert, empty_data])
else:
  dts = [dt.strftime('%Y-%m-%d %H:%M') for dt in 
  datetime_range(datetime(curr_date_year,curr_date_month,curr_date_day,data['sun_open_hrs'].iloc[j],data['sun_open_min'].iloc[j]), datetime(curr_date_year,curr_date_month,curr_date_day, data['sun_close_hrs'].iloc[j],data['sun_close_min'].iloc[j]), 
  timedelta(minutes=15))]
  vert = pd.DataFrame(dts,columns = ['timestamp'])
  vert['store']= data['store'].iloc[j]
  empty_data = pd.concat([vert, empty_data])


final_data = empty_data

Upvotes: 0

Views: 57

Answers (1)

mimetist
mimetist

Reputation: 171

I think the most time consuming tasks in your script are the datetime calculations.

You should try to make all of those calculations using UNIX Time. It basically represents time as an integer that counts seconds... so you could take two UNIX dates and see the difference just by doing simple subtraction.

In my opinion you should perform all the operations like that... and when the process has finished you can make all the datetime conversions to a more readable date format.

Other thing that you should change in your script is all the code repetition that is almost identical. It won't improve the performance, but it improves readability, debugging and your skills as a programmer. As a simple example I have refactored some of the code (you probably can do better than what I did, but this is just an example).

def datetime_range(start, end, delta):
    current = start
    while current < end:
        yield current
        current += delta

from datetime import timedelta, date, datetime
import numpy as np
import cProfile
import pandas as pd

# inputs
empty_data = pd.DataFrame(columns=['store', 'timestamp'])
start_dt = date(2019, 1, 1)
days = 365

data = "input data | attached to the post"
for i in range(days):
    for j in range(len(data.store)):
        curr_date = start_dt + timedelta(days=i)
        curr_date_year = curr_date.year
        curr_date_month = curr_date.month
        curr_date_day = curr_date.day

        weekno = curr_date.weekday()

        week_range = 'sun'
        if weekno < 5:
            week_range = 'm_f'
        elif weekno == 5:
            week_range = 'sat'

        first_time = datetime(curr_date_year,curr_date_month,curr_date_day,data[week_range + '_open_hrs'].iloc[j],data[week_range + '_open_min'].iloc[j])
        second_time = datetime(curr_date_year,curr_date_month,curr_date_day, data[week_range + '_close_hrs'].iloc[j],data[week_range + '_close_min'].iloc[j])
        dts = [ dt.strftime('%Y-%m-%d %H:%M') for dt in datetime_range(first_time, second_time, timedelta(minutes=15)) ]

        vert = pd.DataFrame(dts, columns = ['timestamp'])
        vert['store']= data['store'].iloc[j]
        empty_data = pd.concat([vert, empty_data])

final_data = empty_data

Good luck!

Upvotes: 1

Related Questions