Reputation: 143
I have a dictionary with 500 DateFrames in it. Each data frame has columns 'date' , 'num_patients'. I apply the model to all the data frames in the dictionary, but Python kernel crash due to large data in the dictionary.
prediction_all = {}
for key, value in dict.items():
model = Prophet(holidays = holidays).fit(value)
future = model.make_future_dataframe(periods = 365)
forecast = model.predict(future)
prediction_all[key] = forecast.tail()
So, then I've subsetted the dictionary and applied the model to each subset.
dict1 = {k: dict[k] for k in sorted(dict.keys())[:50]}
prediction_dict1 = {}
for key, value in dict1.items():
model = Prophet(holidays = holidays).fit(value)
future = model.make_future_dataframe(periods = 365)
forecast = model.predict(future)
prediction_dict1[key] = forecast.tail()
dict2 = {k: dict[k] for k in sorted(dict.keys())[50:100]}
prediction_dict2 = {}
for key, value in dict2.items():
model = Prophet(holidays = holidays).fit(value)
future = model.make_future_dataframe(periods = 365)
forecast = model.predict(future)
prediction_dict2[key] = forecast.tail()
But I will need to run the code above for 10 times since I have 500 DataFrames (10 subsets). Is there a more efficient way to do this?
Upvotes: 2
Views: 1002
Reputation: 226336
One immediate improvement is to drop the sorted() and slicing step and replace it with heapq.nsmallest() which will do many fewer comparisons. Also, the .keys()
is not necessary since dicts automatically iterate over their keys by default.
Replace:
dict1 = {k: dict[k] for k in sorted(dict.keys())[:50]}
dict2 = {k: dict[k] for k in sorted(dict.keys())[50:100]}
With:
lowest_keys = heapq.nsmallest(100, dict)
dict1 = {k : dict[k] for k in lowest_keys[:50]}
dict2 = {k : dict[k] for k in lowest_keys[50:100]}
The big for-loop in your code looks to only need .values()
instead of .items()
since key doesn't seem to be used.
Upvotes: 3