Reputation: 2837
I am using Python 3.6.4 on Windows 10 with Fall Creators Update. I am attempting to read a XML file using the following code:
with open('file.xml', 'rt', encoding='utf8') as file:
for line in file.readline():
do_something(line)
readline()
is returning a single character on each call, not a complete line. The file was produced on Linux, is definitely encoded as UTF8, has nothing special such as a BOM at the beginning and has been verified with a hex dump to contain valid data. The line end is 0x0a
since it comes from Linux. I tried specifying -1
as the argument to readline()
, which should be the default, without any change in behavior. The file is very large (>240GB) but the problem is occurring at the start of the file.
Any suggestions as to what I might be doing wrong?
Upvotes: 5
Views: 8162
Reputation: 31
readline() returns a string representing a line in the file while readlines() returns a list, each item is a line. So it's clear that
for line in file.readline()
is iterating over a string, that's why you got a character
If you want to iterate over the file and avoid jamming your memory, try this:
line = '1'
while line:
line = f.readline()
if !line:
break
do_something(line)
or:
line = f.readline()
while line:
do_something(line)
line = f.readline()
By the way, beautifulsoup is a useful package for xml phrasing.
Upvotes: 2
Reputation: 28580
readline()
will return a single line as a string (which you then iterate over). You should probably use readlines()
instead, as this will give you a list of lines which your for-loop will iterate over, one line at a time.
Even better, and more efficient:
for line in file:
do_something(line)
Upvotes: 10