godzilla
godzilla

Reputation: 3125

Find words that appear only once

I am retrieving only unique words in a file, here is what I have so far, however is there a better way to achieve this in python in terms of big O notation? Right now this is n squared

def retHapax():
    file = open("myfile.txt")
    myMap = {}
    uniqueMap = {}
    for i in file:
        myList = i.split(' ')
        for j in myList:
            j = j.rstrip()
            if j in myMap:
                del uniqueMap[j]
            else:
                myMap[j] = 1
                uniqueMap[j] = 1
    file.close()
    print uniqueMap

Upvotes: 1

Views: 4222

Answers (4)

Jon Clements
Jon Clements

Reputation: 142206

I'd go with the collections.Counter approach, but if you only wanted to use sets, then you could do so by:

with open('myfile.txt') as input_file:
    all_words = set()
    dupes = set() 
    for word in (word for line in input_file for word in line.split()):
        if word in all_words:
            dupes.add(word)
        all_words.add(word)

    unique = all_words - dupes

Given an input of:

one two three
two three four
four five six

Has an output of:

{'five', 'one', 'six'}

Upvotes: 3

itzMEonTV
itzMEonTV

Reputation: 20359

Try this to get unique words in a file.using Counter

from collections import Counter
with open("myfile.txt") as input_file:
    word_counts = Counter(word for line in input_file for word in line.split())
>>> [word for (word, count) in word_counts.iteritems() if count==1]
-> list of unique words (words that appear exactly once)

Upvotes: 2

AChampion
AChampion

Reputation: 30268

You could slightly modify your logic and move it from unique on second occurrence (example using sets instead of dicts):

words = set()
unique_words = set()
for w in (word.strip() for line in f for word in line.split(' ')):
    if w in words:
        continue
    if w in unique_words:
        unique_words.remove(w)
        words.add(w)
    else:
        unique_words.add(w)
print(unique_words)

Upvotes: 1

Padraic Cunningham
Padraic Cunningham

Reputation: 180481

If you want to find all unique words and consider foo the same as foo. and you need to strip punctuation.

from collections import Counter
from string import punctuation

with open("myfile.txt") as f:
    word_counts = Counter(word.strip(punctuation) for line in f for word in line.split())

print([word for word, count in word_counts.iteritems() if count == 1])

If you want to ignore case you also need to use line.lower(). If you want to accurately get unique word then there is more involved than just splitting the lines on whitespace.

Upvotes: 3

Related Questions