user1456524
user1456524

Reputation: 11

python, seek, tell, read. Reading lines from giant csv file

I have a giant file (1.2GB) of feature vectors saved as a csv file. In order to go through the lines, I've created a python class that loads batches of rows from the giant file, to the memory, one batch at a time.

In order for this class to know where exactly to read in the file to get a batch of batch_size complete rows (lets say batch_size=10,000), in the first time using a giant file, this class goes through the entire file once, and registers the offset of each line, and saves these offsets to a helping file, so that later it could "file.seek(starting_offset); batch = file.read(num_bytes)" to read the next batch of lines.

First, I implemented the registration of line offsets in this manner:

    offset = 0;
    line_offsets = [];
    for line in self.fid:
        line_offsets.append(offset);
        offset += len(line); 

And it worked lovely with giant_file1.

But then I processed these features and created giant_file2 (with normalized features), with the assistance of this class I made. And next, when I wanted to read batches of lines form giant_file2, it failed, because the batch strings it would read were not in the right place (for instance, reading something like "-00\n15.467e-04,..." instead of "15.467e-04,...\n").

So I tried changing the line offset calculation part to:

    offset = 0;
    line_offsets = [];
    while True:
        line = self.fid.readline();

        if (len(line) <= 0):
            break;

        line_offsets.append(offset);
        offset = self.fid.tell();

The main change is that the offset I register is taken from the result of fid.tell() instead of cumulative lengths of lines.

This version worked well with giant_file2, but failed with giant_file1.

The further I investigated it, I came to the feeling that functions seek(), tell() and read() are inconsistent with each other. For instance:

fid = file('giant_file1.csv');
fid.readline();
>>>'0.089,169.039,10.375,-30.838,59.171,-50.867,13.968,1.599,-26.718,0.507,-8.967,-8.736,\n'
fid.tell();
>>>67L
fid.readline();
>>>'15.375,91.43,15.754,-147.691,54.234,54.478,-0.435,32.364,4.64,29.479,4.835,-16.697,\n'
fid.seek(67);
fid.tell();
>>>67L
fid.readline();
>>>'507,-8.967,-8.736,\n'

There is some contradiction here: when I'm positioned (according to fid.tell()) at byte 67 once the line read is one thing and in the second time (again when fid.tell() reports I'm positioned at byte 67) the line that is read is different.

I can't trust tell() and seek() to put me in the desired location to read from the beginning of the desired line. On the other hand, when I use (with giant_file1) the length of strings as reference for seek() I get the correct position:

fid.seek(0);
line = fid.readline();
fid.tell();
>>>87L
len(line);
>>>86
fid.seek(86);
fid.readline();
>>>'15.375,91.43,15.754,-147.691,54.234,54.478,-0.435,32.364,4.64,29.479,4.835,-16.697,\n'

So what is going on?

The only difference between giant_file1 and giant_file2 that I can think of is that in giant_file1 the values are written with decimal dot (e.g. -0.435), and in giant_file2 they are all in scientific format (e.g. -4.350e-01). I don't think any of them is coded in unicode (I think so, since the strings I read with simple file.read() seem readable. how can I make sure?).

I would very much appreciate your help, with explanations, ideas for the cause, and possible solutions (or workarounds).

Thank you, Yonatan.

Upvotes: 1

Views: 5788

Answers (2)

parselmouth
parselmouth

Reputation: 1668

I think you have a newline problem. Check whether giant_file1.csv is ending lines with \n or \r\n If you open the file in text mode, the file will return lines ending with \n, only and throw away redundant \r. So, when you look at the length of the line returned, it will be 1 off of the actual file position (which has consumed not just the \n, but also the \r\n). These errors will accumulate as you read more lines, of course.

The solution is to open the file in binary mode, instead. In this mode, there is no \r\n -> \n reduction, so your tally of line lengths would stay consistent with your file tell( ) queries.

I hope that solves it for you - as it's an easy fix. :) Good luck with your project and happy coding!

Upvotes: 2

reptilicus
reptilicus

Reputation: 10407

I had to do something similar in the past and ran into something in the standard library called linecache. You might want to look into that as well.

http://docs.python.org/library/linecache.html

Upvotes: 0

Related Questions