jramm
jramm

Reputation: 6655

Python: Seeking to EOL in file not working

I have this method:

def get_chunksize(path):
    """
    Breaks a file into chunks and yields the chunk sizes.
    Number of chunks equals the number of available cores.
    Ensures that each chunk ends at an EOL.
    """
    size = os.path.getsize(path)
    cores = mp.cpu_count()
    chunksize = size/cores # gives truncated integer

    f = open(path)
    while 1:
        start = f.tell()
        f.seek(chunksize, 1) # Go to the next chunk
        s = f.readline() # Ensure the chunk ends at the end of a line
        yield start, f.tell()-start
        if not s:
            break

It is supposed to break a file into chunks and return the start of the chunk (in bytes) and the chunk size.

Crucially, the end of a chunk should correspond to the end of a line (which is why the f.readline() behaviour is there), but I am finding that my chunks are not seeking to an EOL at all.

The purpose of the method is to then read chunks which can be passed to a csv.reader instance (via StringIO) for further processing.

I've been unable to spot anything obviously wrong with the function...any ideas why it is not moving to the EOL?

I came up with this rather clunky alternative:

def line_chunker(path):
    size = os.path.getsize(path)
    cores = mp.cpu_count()
    chunksize = size/cores # gives truncated integer

    f = open(path)

    while True:
        part = f.readlines(chunksize)
        yield csv.reader(StringIO("".join(part)))
        if not part:
            break

This will split the file into chunks with a csv reader for each chunk, but the last chunk is always empty (??) and having to join the list of strings back together is rather clunky.

Upvotes: 3

Views: 199

Answers (1)

owns
owns

Reputation: 325

if not s:
        break

Instead of looking at s to see if you're at the end of the file, you should look if you've reached the end of the file by using:

if size == f.tell(): break

this should fix it. I wouldn't depend on a CSV file having a single record per line though. I've worked with several CSV files that have strings with new-lines:

first,last,message
sue,ee,hello
bob,builder,"hello,
this is some text
that I entered"
jim,bob,I'm not so creative...

Notice the 2nd record (bob) spans across 3 lines. csv.reader can handle this. If the idea is to do some cpu intensive work on a csv. I'd create an array of threads, each with a buffer of n records. have the csv.reader pass a record to each thread using round-robin, skipping a thread if its buffer is full.
Hope this helps - enjoy.

Upvotes: 1

Related Questions