Reputation: 259

Parsing word list in python

I have a wlist.txt file of about 58k words of the english language, a small excerpt of which looks like this :

aardvark
aardwolf
aaron
aback
abacus
abaft
abalone
abandon
abandoned
abandonment
abandons
abase
abased
abasement

What I would like to do is have a program search through the list and see if a word is contained in the list, and if so print the word. My issue is that the code i have written will constantly return that no, the word is not in the list, when i know for sure that it is. My code looks like this, anybody notice any bugs?

match = 'aardvark'
f = 'wlist.txt'
success = False
try:
    for word in open(f):
        if word == match:
            success = True
            break
except IOError:
    print f, "not found!"
if success:
    print "The word has been found with a value of", word
else:
    print "Word not found"

Thanks in advance everyone!!

Upvotes: 1

Answers (11)

vidiv

Reputation: 59

Guess you can do it using regular expressions(re) in python. Just import library re using import re re.search(pattern,source) or re.findall(pattern,source)

with open('wordlist.txt', 'rU') as infile: for item in infile.readlines():

if re.search(r'^aardvark',item):
    print('word found')
else:
    print('word not found')

Upvotes: 0

Escualo

Reputation: 42072

Your problem is much simpler to solve. You have not realized that you can read the entire list in memory at a very modest cost - your file is less than 1 MB, it fits perfectly well in memory.

The solution to your problem is to read the entire thing into an array and use standard list methods to look for membership:

# this is the only thing you need to get all the words in memory
words = [w.strip() for w in open("words.txt", "rb").readlines()]

# this is the only thing you need to find wether a word is in the list
print 'aaron' in words
# returns 'True'

# now you can go around many times and ask for membership of any word, 
# or any list of words (use a loop) - the array is already in memory 
# and will stay there until you close the program - it's only 1 mega!

It can be argued that my solution is not clever, but I think it is practical - premature optimization is the root of all evil, and by trying to write a clever loop, you are missing out in a perfectly simple approach that given your problem works perfectly well (by the way, the first call to the function takes less than a second for a 60-thousand-word text file, every search is also extremely fast).

~~Notice: you don't need a set (you don't care if a word is repeated - the answer is the same).~~

Don't solve the wrong problem!

PS. A lot of people seem to think that 58k words is "a lot" - it is (58 + average length) kB (if the words are ~10 letters each, that's 580 kB - about half a meg). When I hear people saying that you should not open that in memory, I wonder how do they open their pictures! It is a paradigm that needs to be broken. People will claim "but your program is not robust, because if the list becomes 100 million gizillion lines, it will break", and that's fair in a world where the English language will increase its vocabulary by 10 orders of magnitude. We often forget that general means general for your domain.

Edit: As per @Chinmay comments, using a set over a list has dramatic access consequences. Using a 58K word list, I ran two 1000 access exercises: list, and set (access time in microseconds):

container    min    max   mean
list           3   1646  724.4
set            1     31    1.6

So, as @Chinmay points out, the mean access time is almost three orders of magnitude smaller for a set. This can make a difference if you are accessing the words many times (which you presumably are).

So, I sand corrected and modify the code to:

# create a set of words
words = set(w.strip() for w in open('file.txt').readlines())

# test access using the `in` operator, as :
'aaron' in words
# will return True

My point remains: the solution to this problem is much simpler than creating a class to implement the membership operator.

Upvotes: 1

dting

Reputation: 39287

file = open(f)
words = set( (line.strip() for line in file.readlines()) )
file.close()

if match in words:
    print "The word has been found with a value of", word
else:
    print "Word not found"

Upvotes: 1

Chinmay Kanchi

Reputation: 65893

As others have already said, your problem stems from the fact that the newline characters are part of the words you are reading in. The best way to get rid of these is to use the strip() method of str.

In addition, your code does too much to accomplish a simple task. All you need to do is build a set from your word list and look for the occurrence of your word in the set. A set is far better for this task than a list because checking for the occurrence of an element in a set is much faster. So something like this should work.

try:
    with open('wordlist.txt', 'rU') as infile:
        wordSet = set(line.strip() for line in infile)
except IOError:
       print 'error opening file'

aWord = 'aardvark'

if aWord in wordSet:
    print 'found word', aWord
else:
    print 'word not found'

Note: if aWord in wordSet is so much faster it isn't funny. If you're looking for a word closer to the end of the word list, set is nearly 60000 times faster for a 267000 word list. And it's still marginally faster even if you're looking for the very first word.

Upvotes: 6

Apalala

Reputation: 9224

The problem in the code you posted is that iterating over an open file includes the newline characters. Other answers deal with that issue.

This answer is to point out that the strategy is very inefficient if the search is to be done often.

If the search will be performed a number of times, then it is best to to store the word list as a Trie, which enables O(m) lookups, with m being the length of the searched string, while constructing the Trie has complexity similar to searching the word list for one word. The Trie can be stored to disk (pickled?) for fast retrieval.

Looking up a word against the dictionary using the posted code takes time proportional to the size of the dictionary, which is O(n). Building the Trie is O(n+C), with a big C, which makes it worth it only if there are going to be frequent searches.

I took a look, and the Web says that there are several implementations of Trie in Python ready to try .

Upvotes: 0

Hugh Bothwell

Reputation: 56634

class WordMatcher(object):
    @classmethod
    def fromFile(cls, fname):
        with open(fname) as inf:
            return cls(inf)

    def __init__(self, words):
        super(WordMatcher,self).__init__()
        self.words = set(word.strip().lower() for word in words)

    def __contains__(self, word):
        return word.strip().lower() in self.words

    def goodWords(self, lst):
        _sw = self.words
        for word in lst:
            word = word.strip().lower()
            if word in _sw:
                yield word

wordlist = WordMatcher.fromFile('wordlist.txt')

'abase' in wordlist   # -> True
list(wordlist.goodWords(['Abandon', 'abased\n', 'xyzzy']))  # -> ['abandon','abased']

Upvotes: 0

Whatang

Reputation: 10356

Everyone's given you good advice about how to do this, but do you really need to use python?

grep aardvark wlist.txt

It'll almost certainly destroy any python based solution for speed. fgrep will probably be even quicker.

Upvotes: 3

nesv

Reputation: 812

Here is my very simple suggestion:

wordlist = map(str.strip, open("wlist.txt", "r").readlines())
if word in wordlist:
   print "The word has been found with a value of", word
else:
   print "Word not found"

Upvotes: 1

anijhaw

Reputation: 9402

Here is the code that should work

match = 'aardvark'
    f = 'wlist.txt'
    success = False
    try:
        for word in open(f):
            if word.strip() == match: # Change here 
                success = True
                break
    except IOError:
        print f, "not found!"
    if success:
        print "The word has been found with a value of", word
    else:
        print "Word not found"

Upvotes: 3

srgerg

Reputation: 19329

Try replacing word == match with word[0:-1] == match to remove the newline character at the end of word.

Edit: Alternately, replace word == match with word.rstrip() == match as suggested in this question.

Upvotes: 1

Edward Z. Yang

Reputation: 26742

Iteration on file objects includes newlines.

Upvotes: 1

Parsing word list in python

Answers (11)

Related Questions