Jaime2823
Jaime2823

Reputation: 9

How to check if there are duplicates words in a file

I need to return True if any duplicates in the file. This is what I have but is not correct.

def duplicate(filename):
    infile = open(filename)
    contents = infile.read()
    infile.close()
    words = contents.split()
    for word in words:
        if words.count(word) > 1:
            return True
        else:
            return False

file contents

This is a file with a duplicate. Just one.
You may try to find another but you'll never see it.

Upvotes: 0

Views: 176

Answers (2)

C.Nivs
C.Nivs

Reputation: 13106

Usually a dictionary is nice for this kind of task (I'd suggest using a Counter, but I don't think you're quite there yet).

Dictionaries are great for grouping data, since the keys are unique, and can be really useful for membership testing, since the speed of the test does not depend on the size of the dict. In this case, you can track the keys as words and the counts as values. Then return False on the first dupe, which it looks like you tried to do:

def has_duplicate(filename):
    # create the dictionary here
    words = {}

    # it is best to use a with statement to open a file
    # that way you don't have to close it
    with open(filename) as infile:
        # you can iterate directly over the file
        for line in infile:
            for word in line.split():

                # if the word is in the dictionary
                # then you've seen it before and it's a duplicate
                if word in words:
                    return True

                # Otherwise, add it
                else:
                    words[word] = 1
    return False

This won't handle differences in capitalization or punctuation, as a caveat

Upvotes: 0

OneCricketeer
OneCricketeer

Reputation: 191854

You're returning on the first word count. Don't return false until inspecting all words

for word in words:
    if words.count(word) > 1:
        return True
 return False

Also, you're not stripping punctuation, so word! would be unique from word

It's also more performant to use a Counter object

Plus, it's better to open a file like so

with open(filename) as infile:
    lines = infile.readlines()
    for line in lines:
        for word in line.split():
            ...
return False 

Upvotes: 2

Related Questions