Artur
Artur

Reputation: 7257

Unable to read huge (20GB) file from CPython

I have some CPython issue that I cannot understand. It all boils down to the fact that using the same code to read small text file works but cannot even read a single line from 20GB txt file.

Some useful info:

The obvious solution:

f = open(r'filename', 'r')
for line in f:
    print(line)
f.close()

works...but..only for short file. For the big one hangs forever (or longer that it should take to print at least the first line).

So I wanted to at least try to read one line like this:

f = open(r'filename', 'r')
print(f.readline())
f.close()

Similar situation here - works for small file instantly but for the big one after substantial amount of time spits that message:

Traceback (most recent call last):
  File "***", line 16, in <module>
    print(f.readline())
SystemError: ..\Objects\stringobject.c:3902: bad argument to internal function

How the heck should I read a big text file?

UPDATE:

Turns out human being thinks clearer whan having enough sleep ;-). The problem is solved - turns out I've overlooked one sentence in the documentation:

Python is usually built with universal newlines support; supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'.

Just thought universal newlines are 'turned on' by default.

My above statement that:

print(f.readline())

was reading just one line was partially false (my bad). Remember I said my small file was created by taking chunk of the big one? During that operation line endings changed from (CR) to (CRLF) so what I saw was the first line. All of that made me think that problem is not in line endings.

Thank you all for time and help.

Upvotes: 2

Views: 430

Answers (2)

beroe
beroe

Reputation: 12316

Although your "test" only prints one line, that does not mean it is only reading one line from the file. For me in a \r-delimited test file, I also only get one line of output. However if I read each line in using a for loop, it still only prints one line. Or if I try readline() a second time on a multi-line file, it doesn't give any more lines.

Try opening your file with the 'rU' parameter on the same file:

f =  open('filename', 'rU')

My tests of a file with several lines of \r-delimited text give:

f = open('test.txt','r')  # Opening the "wrong" way
for line in f:
    print line

Output:

abcdef

Then with rU:

f = open('test.txt','rU')
for line in f:
    print line

Output:

abcdef

abcdef

abcdef

abcdef

abcdef

EDIT: In support of Joran's explanation, this test pretty much shows it to be the case that the entire file is loading and the carriage return character is causing over-printing when you see only one line of output...

f = open('test.txt','r')     #  Opening the "wrong" way again
for line in f:
    print "XXX{}YYY".format(line)

Output gets overwritten...

YYYdefdef

Upvotes: 5

Joran Beasley
Joran Beasley

Reputation: 114008

def my_readline(fh,delim):
    return "".join(iter(lambda:fh.read(1),delim))

f = open(some_file)
line = my_readline(f,"\r")

should work if you can at least get .read(1) to work ... but if that doesnt work I dont know that anything will ... maybe use shell commands to split the file into smaller chunks somehow ... but I suspect beroe's answer is the real answer

Upvotes: 0

Related Questions