Jonathan
Jonathan

Reputation: 2837

readline() returns a character at a time

I am using Python 3.6.4 on Windows 10 with Fall Creators Update. I am attempting to read a XML file using the following code:

with open('file.xml', 'rt', encoding='utf8') as file:
    for line in file.readline():
        do_something(line)

readline() is returning a single character on each call, not a complete line. The file was produced on Linux, is definitely encoded as UTF8, has nothing special such as a BOM at the beginning and has been verified with a hex dump to contain valid data. The line end is 0x0a since it comes from Linux. I tried specifying -1 as the argument to readline(), which should be the default, without any change in behavior. The file is very large (>240GB) but the problem is occurring at the start of the file.

Any suggestions as to what I might be doing wrong?

Upvotes: 5

Views: 8162

Answers (2)

callmebob
callmebob

Reputation: 31

readline() returns a string representing a line in the file while readlines() returns a list, each item is a line. So it's clear that

for line in file.readline()

is iterating over a string, that's why you got a character

If you want to iterate over the file and avoid jamming your memory, try this:

line = '1'
while line:
    line = f.readline() 
    if !line:
        break
    do_something(line)

or:

line = f.readline()
while line:
    do_something(line)
    line = f.readline()

By the way, beautifulsoup is a useful package for xml phrasing.

Upvotes: 2

matli
matli

Reputation: 28580

readline() will return a single line as a string (which you then iterate over). You should probably use readlines() instead, as this will give you a list of lines which your for-loop will iterate over, one line at a time.

Even better, and more efficient:

    for line in file:
        do_something(line)

Upvotes: 10

Related Questions