Kelly
Kelly

Reputation: 27

iterating through a list of dictionaries makes my function unusably slow

My function hourlycounts(), which I am running on a dataset of 8000+ records, is prohibitively slow, running for 30 mins before I give up. I am looking for a more efficient way to write this function.

Here is my reprex with a small (n=3) and simplified set of data. The problem I am encountering when asking for help is that with this small amount of data it works fine, but of course I can't post my entire dataset here or the tokenized API.

I think that hourlycounts() is inefficient because it is traversing through the every dictionary in testdata with every iteration. I just can't think of a better way to get what I need from it. In the real dataset there is one dictionary for every hour of 2020 and I eventually want to be able to aggregate the data by date.

import datetime
import pandas as pd
import time

testdata = [{'Date': datetime.date(2020, 6, 1),
  'Time': datetime.time(0, 0),
  'Total': '5',
  'Year': 2020},
 {'Date': datetime.date(2020, 6, 1),
  'Time': datetime.time(1, 0),
  'Total': '7',
  'Year': 2020},
 {'Date': datetime.date(2020, 6, 2),
  'Time': datetime.time(3, 0),
  'Total': '1',
  'Year': 2020}]

def timestamp(mmddyyyy):
    return pd.to_datetime(mmddyyyy)

def hourlycounts(date, testlist): #date must be entered as 'mm/dd/yy', including quotes
    countlist=[]
    date_filtered = list(record for record in testlist if record['Date']==timestamp(date))
    list(map(lambda record: countlist.append(record['Total']), date_filtered))
    return countlist

hourlycounts('06/01/2020', testdata)

How can I make hourlycounts() more efficient? Thank you for helping a beginner!

Upvotes: 0

Views: 604

Answers (2)

Mark Ransom
Mark Ransom

Reputation: 308140

The problem is almost certainly in the part of the code you didn't share, the part that calls hourlycounts. Presumably it is called once for each date of interest, and each time it's called it is forced to run over the entire set of input data.

You will do much better if you sort the data by date, then use itertools.groupby to get all the entries for each date. You can store those in a list, or in another dictionary that is keyed by date.

import itertools

def keyfunc(rec):
    return rec['Date']

def dailycounts_as_list(testlist):
    return [list(g) for k, g in itertools.groupby(sorted(testlist, key=keyfunc), key=keyfunc)]

def dailycounts_as_dict(testlist):
    return {k: list(g) for k, g in itertools.groupby(sorted(testlist, key=keyfunc), key=keyfunc)}

In the comments @ggorlen suggests a different approach that also creates a dictionary, but would be faster.

import collections

def dailycounts(testlist):
    result = collections.defaultdict(list)
    for rec in testlist:
        result[rec['Date']].append(rec)
    return result

Each member of the returned dictionary is a list of all the records corresponding to that date.

Upvotes: 0

ggorlen
ggorlen

Reputation: 56895

There are many micro-issues with the code that might not matter in the big picture, but are well worth considering regardless, and one big-picture organizational problem that might be prohibitive to implement for your use case.


Micro issues

Don't abuse generators as list comps

Generators are slower than list comprehensions when you're exhausting the entire iterable.

Change

date_filtered = list(record for record in testlist if record['Date']==timestamp(date))

to

date_filtered = [record for record in testlist if record['Date'] == timestamp(date)]

Cache unchanging function calls even if they're cheap

Here, even though timestamp(date) never changes, it's computed again and again for every record--that's 2n extra function call overhead incurred for no reason.

Change

date_filtered = [record for record in testlist if record['Date'] == timestamp(date)]

to

target_date = timestamp(date)
date_filtered = [record for record in testlist if record['Date'] == target_date]

Don't abuse map for side effects

The following code abuses map for a side-effect to mutate countlist (and incurring more object allocation/function call overhead without reason):

countlist=[]
list(map(lambda record: countlist.append(record['Total']), date_filtered))
return countlist

Better/faster/cleaner:

return [record['Total'] for record in date_filtered]

Do multiple operations in one pass

After these optimizations, the whole function becomes:

def hourly_counts(date, data):
    """ date must be entered as 'mm/dd/yy', including quotes """
    target_date = timestamp(date)
    date_filtered = [x for x in data if x['Date'] == target_date]
    return [x['Total'] for x in date_filtered]

It's easy to see we can combine the final two list comps:

def hourly_counts(date, data):
    """ date must be entered as 'mm/dd/yy', including quotes """
    target_date = timestamp(date)
    return [x['Total'] for x in data if x['Date'] == target_date]

Not only should this be faster by a fairly substantial constant factor, it's certainly much cleaner.


Macro issue

All of the above optimizations aren't going to change the fundamental design situation, which is that you have a list of dicts. Whenever you have a list and you want to find an arbitrary record or perform a filtering operation, the best you can do is O(n). As such, if you want any hope of a speedup, you'll need to reorganize your data to use a dictionary as the outermost structure keyed by the property you care about, in this case, target_date, or the result of timestamp(date). This gives you O(1) lookup.

However, performing this transformation is O(n) at minimum, so if the script can't cache the data and perform repeated queries to make up for the up-front cost of transforming the data to an O(1) lookup on target_date, your only hope is to adjust the way the data is delivered to you (from file, API, etc).

If the data is always in a list and you're only doing one filter operation, then I don't see much room for optimization other than parallelization using multiprocessing, and that relies on being able to receive the data in some paginated form so that you avoid a bottleneck where a single process winds up requesting everything.

Another possibility is that the dates can be provided to you in a pre-sorted order based on the target date key, then you can potentially use binary search to locate the range of dates.

Finally, there is a weird situation where you're importing Pandas just to call a datetime function that you already have, datetime.date. It's sort of like buying a car and throwing the whole thing out just to keep a tire. If you can get your data as a dataframe, you can potentially take advantage of vectorized Pandas/NumPy operations (but as above, building the df would be O(n) before you can begin benefitting from it, so this would be use-case specific).

By the way, 30 minutes for 8000 records strikes me as highly unrealistic, even on an old notebook as I have. It takes about 24 seconds to run OP's original code with 8000 records appended to testdata and 2 seconds to run my simplified version, including creating all of the records. I don't think a single call to your version of hourlycounts should take anywhere near as long as 30 minutes, so something might be lost in translation here.

Upvotes: 1

Related Questions