Reputation: 6655
I have this method:
def get_chunksize(path):
"""
Breaks a file into chunks and yields the chunk sizes.
Number of chunks equals the number of available cores.
Ensures that each chunk ends at an EOL.
"""
size = os.path.getsize(path)
cores = mp.cpu_count()
chunksize = size/cores # gives truncated integer
f = open(path)
while 1:
start = f.tell()
f.seek(chunksize, 1) # Go to the next chunk
s = f.readline() # Ensure the chunk ends at the end of a line
yield start, f.tell()-start
if not s:
break
It is supposed to break a file into chunks and return the start of the chunk (in bytes) and the chunk size.
Crucially, the end of a chunk should correspond to the end of a line (which is why the f.readline()
behaviour is there), but I am finding that my chunks are not seeking to an EOL at all.
The purpose of the method is to then read chunks which can be passed to a csv.reader
instance (via StringIO
) for further processing.
I've been unable to spot anything obviously wrong with the function...any ideas why it is not moving to the EOL?
I came up with this rather clunky alternative:
def line_chunker(path):
size = os.path.getsize(path)
cores = mp.cpu_count()
chunksize = size/cores # gives truncated integer
f = open(path)
while True:
part = f.readlines(chunksize)
yield csv.reader(StringIO("".join(part)))
if not part:
break
This will split the file into chunks with a csv reader for each chunk, but the last chunk is always empty (??) and having to join the list of strings back together is rather clunky.
Upvotes: 3
Views: 199
Reputation: 325
if not s: break
Instead of looking at s
to see if you're at the end of the file, you should look if you've reached the end of the file by using:
if size == f.tell(): break
this should fix it. I wouldn't depend on a CSV file having a single record per line though. I've worked with several CSV files that have strings with new-lines:
first,last,message
sue,ee,hello
bob,builder,"hello,
this is some text
that I entered"
jim,bob,I'm not so creative...
Notice the 2nd record (bob) spans across 3 lines. csv.reader can handle this. If the idea is to do some cpu intensive work on a csv. I'd create an array of threads, each with a buffer of n records. have the csv.reader pass a record to each thread using round-robin, skipping a thread if its buffer is full.
Hope this helps - enjoy.
Upvotes: 1