pavlogiannis
pavlogiannis

Reputation: 303

Iterate through words of a file in Python

I need to iterate through the words of a large file, which consists of a single, long long line. I am aware of methods iterating through the file line by line, however they are not applicable in my case, because of its single line structure.

Any alternatives?

Upvotes: 14

Views: 35204

Answers (8)

smac89
smac89

Reputation: 43128

I've answered a similar question before, but I have refined the method used in that answer and here is the updated version (copied from a recent answer):

Here is my totally functional approach which avoids having to read and split lines. It makes use of the itertools module:

Note for python 3, replace itertools.imap with map

import itertools

def readwords(mfile):
    byte_stream = itertools.groupby(
      itertools.takewhile(lambda c: bool(c),
          itertools.imap(mfile.read,
              itertools.repeat(1))), str.isspace)

    return ("".join(group) for pred, group in byte_stream if not pred)

Sample usage:

>>> import sys
>>> for w in readwords(sys.stdin):
...     print (w)
... 
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python
           
It's soo very Functional!
It's
soo
very
Functional!
>>>

I guess in your case, this would be the way to use the function:

with open('words.txt', 'r') as f:
    for word in readwords(f):
        print(word)

Upvotes: 1

Vikas
Vikas

Reputation: 2028

What Donald Miner suggested looks good. Simple and short. I used the below in a code that I have written some time ago:

l = []
f = open("filename.txt", "rU")
for line in f:
    for word in line.split()
        l.append(word)

longer version of what Donald Miner suggested.

Upvotes: 0

Andrea Spadaccini
Andrea Spadaccini

Reputation: 12651

It really depends on your definition of word. But try this:

f = file("your-filename-here").read()
for word in f.split():
    # do something with word
    print word

This will use whitespace characters as word boundaries.

Of course, remember to properly open and close the file, this is just a quick example.

Upvotes: 8

laike9m
laike9m

Reputation: 19348

You really should consider using Generator

def word_gen(file):
    for line in file:
        for word in line.split():
            yield word

with open('somefile') as f:
    word_gen(f)

Upvotes: 4

ᅠᅠᅠ
ᅠᅠᅠ

Reputation: 66980

Long long line? I assume the line is too big to reasonably fit in memory, so you want some kind of buffering.

First of all, this is a bad format; if you have any kind of control over the file, make it one word per line.

If not, use something like:

line = ''
while True:
    word, space, line = line.partition(' ')
    if space:
        # A word was found
        yield word
    else:
        # A word was not found; read a chunk of data from file
        next_chunk = input_file.read(1000)
        if next_chunk:
            # Add the chunk to our line
            line = word + next_chunk
        else:
            # No more data; yield the last word and return
            yield word.rstrip('\n')
            return

Upvotes: 6

Arjor
Arjor

Reputation: 1049

After reading the line you could do:

l = len(pattern)
i = 0
while True:
    i = str.find(pattern, i)
    if i == -1:
        break
    print str[i:i+l] # or do whatever
    i += l

Alex.

Upvotes: 0

Donald Miner
Donald Miner

Reputation: 39903

There are more efficient ways of doing this, but syntactically, this might be the shortest:

 words = open('myfile').read().split()

If memory is a concern, you aren't going to want to do this because it will load the entire thing into memory, instead of iterating over it.

Upvotes: 3

user462356
user462356

Reputation:

Read in the line as normal, then split it on whitespace to break it down into words?

Something like:

word_list = loaded_string.split()

Upvotes: 0

Related Questions