Reputation: 565
I have a long (many Millions) list of Transactions in the form [account_id, transactiontyp, data], each account has many Transactions. I want to select all transactions for a given short list (approx. 20000) of accounts. Example:
longlist=[['a','t1',5],['a','t1',9],['b','t1',3],['c','t5',8]]
shortlist=['a','c']
The following code does the trick, but is extremly slow:
selection=[sel for sel in longlist if sel[0] in shortlist]
There must be faster ways to accomplish that? I tried
def select_sample(longlist,shortlist):
ret=[]
for elem in longlist:
if elem[0] in shortlist:
ret.append(elem)
return ret
to see how it scales, as expected it is linear in the size of longlist. Longlist is sorted by account_id. I do not have a unique key in longlist to use dictionaries. Is there something like an index-technique, which I could use?
Upvotes: 0
Views: 309
Reputation: 133919
Using set
(shortlist = set(['a', 'c']
) would have a break-even at around 20 accounts. I'd expect it to be at least 2 decades faster if shortlist
has 20k accounts.
If this selection is repeated many times, you'd benefit even more by using a dictionary that maps account name to list of transactions for that specific account.
However, all in all this smells like an XY problem. Surely you should use a RDBMS to manage this data? They have got efficient algorithms to handle exactly these kinds of queries.
Upvotes: 3