ashishsingal
ashishsingal

Reputation: 2978

Itertools.combinations performance issue with large lists

I've got a list of tickers in tickers_list and I need to make a combination of the unique elements.

Here's my code:

corr_list = []
i = 0
for ticker in tickers_list:
    tmp_list = tickers_list[i+1:]
    for tl_ticker in tmp_list:
        corr_list.append({'ticker1':ticker, 'ticker2':tl_ticker})
    i = i+1
len(corr_list)

Here's the code using iteritems:

from itertools import combinations
result = combinations(tickers_list, 2)
new_list = [{'ticker1':comb[0], 'ticker2':comb[1]} for comb in combinations(tickers_list, 2)]
len(new_list)

The produce the exact same results. Of course, the iteritems code is much more elegant, and both work perfectly for 10, 100, and 1,000 items. The time required is almost identical (I timed it). However, at 4,000 items, my code takes about 12 seconds to run on my machine while iteritems crashes my computer. The resultant sent is only around 20m rows, so am I doing something wrong or is this an issue with iteritems?

Upvotes: 0

Views: 2352

Answers (1)

Hugh Bothwell
Hugh Bothwell

Reputation: 56644

from itertools import combinations, product

# 20 * 20 * 20 == 8000 items
tickers_list = ["".join(chars) for chars in product("ABCDEFGHIJKLMNOPQRST", repeat=3)]

# 8000 * 7999 / 2 == 31996000 (just under 32 million) items
unique_pairs = [{"ticker1":t1, "ticker2":t2} for t1,t2 in combinations(tickers_list, 2)]

works fine on my machine (Win7Pro x64, Python 3.4) up to len(tickers_list) == 8000 (taking 38s and consuming almost 76 GiB of RAM to do so!).

Are you running 32-bit or 64-bit Python? Do you actually need to store all combinations, or can you use and discard them as you go?


Edit: I miscalculated the size; it is actually about 9.5 GiB. Try it yourself:

from sys import getsizeof as size

size(unique_pairs) + len(unique_pairs) * size(unique_pairs[0])

which gives

9479596592  # total size of structure in bytes == 9.5 GiB 

In any case, this is not a result of using itertools; it is the memory footprint of the resulting list-of-dicts and would be the same regardless of how you generated it. Your computer can probably handle it anyway using virtual RAM (swapping to disk), though it is much slower.

If you are just doing this to get it into a database, I suggest letting the database do the work: make a table containing the items of tickers_list, then select a cross-join to produce the final table, something like

SELECT a.ticker, b.ticker
FROM
    tickers as a,
    tickers as b
WHERE
    a.ticker < b.ticker

Upvotes: 1

Related Questions