Reputation: 1
If I have a list that is made up of 1MM ids, how would I pull from that list in intervals of 50k?
For example:
[1]cusid=df['customer_id'].unique().tolist()
[1]1,000,500
If I want to pull in chunks, is the below correct for 50k?
cusid=cusid[:50000] - first 50k ids
cusid=cusid[50000:100001] - the next 50k of ids
cusid=cusid[100001:150001] - the next 50k
are my interval selections correct?
Thanks!
Upvotes: 0
Views: 77
Reputation: 8572
Couple of things to mention:
It seems that you're using "data science" stack for your work, good chance you have numpy
available, please take a look at numpy.array_split
. You can calculate chunk amount once and use np view machinery. Most probably this is a lot faster than bringing np arrays in to native python lists
Idiomatic python approach (IMO) would be leveraging iterators + islice:
from itertools import islice
# create iterator from your array/list, this is cheap operation
iterator = iter(cusid)
# if you want element-wise operations, you can use your chunk in loops or function that require iterations
# this is really memory-efficient, as you don't put whole chunk in memory
chunk = islice(iterator, 50000)
s = sum(chunk)
# in case you really need whole chunk in memory, just turn isclice into list
chunk = list(islice(iterator, 50000))
last_in_chunk = chunk[-1]
# and you always use same code to consume next chunk from your source
# without maintaining any counters
next_chunk = list(islice(iterator, 50000))
When your iterator
is exhausted (there's no values left) you will get empty chunk(s). When there's not enough elements to create full chunk, you will get as much as is left there.
Upvotes: 1
Reputation: 696
cusid2 = [cusid[a:a+50000] for a in range(0, 950000, 50000)]
This is a list comprehension basically you will add to your list every element cusid[a: a+50000] for a going from 0 to 950000 (so 1m minus 50k) and iterate with a step of 50k so a will go up by 50k every iteration
Upvotes: 1