Peatherfed
Peatherfed

Reputation: 210

Python: How to I read from stdin/file word by word?

As the title says, how do I read from stdin or from a file word by word, rather than line by line? I'm dealing with very large files, not guaranteed to have any newlines, so I'd rather not load all of a file into memory. So the standard solution of:

for line in sys.stdin:
    for word in line:
        foo(word)

won't work, since line may be too large. Even if it's not too large, it's still inefficient since I don't need the entire line at once. I essentially just need to look at a single word at a time, and then forget it and move on to the next one, until EOF.

Upvotes: 0

Views: 687

Answers (2)

Peatherfed
Peatherfed

Reputation: 210

Here's a straightforward answer:

word = ''
with open('filename', 'r') as f:
    while (c := f.read(1)):
        if c.isspace():
            if word:
                print(word) # Here you can do whatever you want e.g. append to list
            word = ''
        else:
            word += c

I will note that it would be faster to read larger byte-chunks at a time, and detecting words after the fact. Ben Y's answer has an (as of this edit) incomplete solution that might be of assistance. If performance (rather than memory, as was my issue) is a problem, that should probably be your approach. The code will be quite a bit longer, however.

Upvotes: 0

Ben Y
Ben Y

Reputation: 1023

Here's a generator approach. I don't know when you plan to stop reading, so this is a forever loop.

def read_by_word(filename, chunk_size=16):
    '''This generator function opens a file and reads it by word'''
    buff = ''  # Preserve word from previous
    with open(filename) as fd:
        while True:
            chunk = fd.read(chunk_size)
            if not chunk:  # Empty means end of file
                if buff:  # Corner case -- file had no whitespace at end
                     # Unfortunately, big chunk sizes could make the
                     # final chunk have spaces in it
                     yield from buff.split()
                break
            chunk = buff + chunk  # Add any previous reads
            if chunk != chunk.rstrip():
                yield chunk.rstrip()  # This chunk ends with whitespace
                buff = ''
            else:
                comp = chunk.split(None, 1)  # At most 1 with whitespace
                if len(comp) == 1:
                    buff += chunk
                    continue
                else:
                    yield comp[0]
                    buff = comp[1]


for word in read_by_word('huge_file_with_few_newlines.txt'):
     print(word)

Upvotes: 1

Related Questions