M15671
M15671

Reputation: 57

How to determine if word in string is a double word?

I want to write a function that takes a file as a string, and returns True if the file has duplicate words and False otherwise.

So far I have:

def double(filename):
    infile = open(filename, 'r')
    res = False
    l = infile.split()
    infile.close()

    for line in l:
        #if line is in l twice
        res = True
    return res

if my file contains: "there is is a same word"

I should get True

if my file contains: "there is not a same word"

I should get False

How do I determine if there is a duplicate of a word in the string

P.S. the duplicate word does not have to come right after the other i.e In "there is a same word in the sentence over there" should return True because "there" is also a duplicate.

Upvotes: 2

Views: 1907

Answers (5)

Raymond Hettinger
Raymond Hettinger

Reputation: 226664

The str.split() method doesn't work well for splitting words in natural English text because of apostrophes and punctuation. You usually need the power of regular expressions for this:

>>> text = """I ain't gonna say ain't, because it isn't
in the dictionary. But my dictionary has it anyways."""

>>> text.lower().split()
['i', "ain't", 'gonna', 'say', "ain't,", 'because', 'it', "isn't", 'in', 'the',
 'dictionary.', 'but', 'my', 'dictionary', 'has', 'it', 'anyways.']

>>> re.findall(r"[a-z']+", text.lower())
['i', "ain't", 'gonna', 'say', "ain't", 'because', 'it', "isn't", 'in', 'the',
 'dictionary', 'but', 'my', 'dictionary', 'has', 'it', 'anyways']

To find whether there are any duplicate words, you can use set operations:

>>> len(words) != len(set(words))
True

To list out the duplicate words, use the multiset operations in collections.Counter:

>>> sorted(Counter(words) - Counter(set(words)))
["ain't", 'dictionary', 'it']

Upvotes: 4

Martijn Pieters
Martijn Pieters

Reputation: 1124658

Use a set to detect duplicates:

def double(filename):
    seen = set()
    with open(filename, 'r') as infile:
        for line in l:
            for word in line.split():
                if word in seen:
                     return True
                seen.add(word)
    return False

You could shorten that to:

def double(filename):
    seen = set()
    with open(filename, 'r') as infile:
        return any(word in seen or seen.add(word) for line in l for word in line.split())

Both versions exit early; as soon as a duplicate word is found, the function returns True; it does have to read the whole file to determine there are no duplicates and return False.

Upvotes: 0

iruvar
iruvar

Reputation: 23394

Another general approach to detecting duplicate words, involving collections.Counter

from itertools import chain
from collections import Counter
with open('test_file.txt') as f:
    x = Counter(chain.from_iterable(line.split() for line in f))
    for (key, value) in x.iteritems():
            if value > 1:
                    print key

Upvotes: 0

a = set()
for line in l:
  if (line in a):
    return True
  a.add(line)
return False

Upvotes: 0

Mike Müller
Mike Müller

Reputation: 85582

def has_duplicates(filename):
    seen = set()
    for line in open(filename):
        for word in line.split():
            if word in seen:
                return True
            seen.add(word)
    return False

Upvotes: 3

Related Questions