Reputation: 210
As the title says, how do I read from stdin or from a file word by word, rather than line by line? I'm dealing with very large files, not guaranteed to have any newlines, so I'd rather not load all of a file into memory. So the standard solution of:
for line in sys.stdin:
for word in line:
foo(word)
won't work, since line may be too large. Even if it's not too large, it's still inefficient since I don't need the entire line at once. I essentially just need to look at a single word at a time, and then forget it and move on to the next one, until EOF.
Upvotes: 0
Views: 687
Reputation: 210
Here's a straightforward answer:
word = ''
with open('filename', 'r') as f:
while (c := f.read(1)):
if c.isspace():
if word:
print(word) # Here you can do whatever you want e.g. append to list
word = ''
else:
word += c
I will note that it would be faster to read larger byte-chunks at a time, and detecting words after the fact. Ben Y's answer has an (as of this edit) incomplete solution that might be of assistance. If performance (rather than memory, as was my issue) is a problem, that should probably be your approach. The code will be quite a bit longer, however.
Upvotes: 0
Reputation: 1023
Here's a generator approach. I don't know when you plan to stop reading, so this is a forever loop.
def read_by_word(filename, chunk_size=16):
'''This generator function opens a file and reads it by word'''
buff = '' # Preserve word from previous
with open(filename) as fd:
while True:
chunk = fd.read(chunk_size)
if not chunk: # Empty means end of file
if buff: # Corner case -- file had no whitespace at end
# Unfortunately, big chunk sizes could make the
# final chunk have spaces in it
yield from buff.split()
break
chunk = buff + chunk # Add any previous reads
if chunk != chunk.rstrip():
yield chunk.rstrip() # This chunk ends with whitespace
buff = ''
else:
comp = chunk.split(None, 1) # At most 1 with whitespace
if len(comp) == 1:
buff += chunk
continue
else:
yield comp[0]
buff = comp[1]
for word in read_by_word('huge_file_with_few_newlines.txt'):
print(word)
Upvotes: 1