Reputation: 57
I want to write a function that takes a file as a string, and returns True if the file has duplicate words and False otherwise.
So far I have:
def double(filename):
infile = open(filename, 'r')
res = False
l = infile.split()
infile.close()
for line in l:
#if line is in l twice
res = True
return res
if my file contains: "there is is a same word"
I should get True
if my file contains: "there is not a same word"
I should get False
How do I determine if there is a duplicate of a word in the string
P.S. the duplicate word does not have to come right after the other i.e In "there is a same word in the sentence over there" should return True because "there" is also a duplicate.
Upvotes: 2
Views: 1907
Reputation: 226664
The str.split() method doesn't work well for splitting words in natural English text because of apostrophes and punctuation. You usually need the power of regular expressions for this:
>>> text = """I ain't gonna say ain't, because it isn't
in the dictionary. But my dictionary has it anyways."""
>>> text.lower().split()
['i', "ain't", 'gonna', 'say', "ain't,", 'because', 'it', "isn't", 'in', 'the',
'dictionary.', 'but', 'my', 'dictionary', 'has', 'it', 'anyways.']
>>> re.findall(r"[a-z']+", text.lower())
['i', "ain't", 'gonna', 'say', "ain't", 'because', 'it', "isn't", 'in', 'the',
'dictionary', 'but', 'my', 'dictionary', 'has', 'it', 'anyways']
To find whether there are any duplicate words, you can use set operations:
>>> len(words) != len(set(words))
True
To list out the duplicate words, use the multiset operations in collections.Counter:
>>> sorted(Counter(words) - Counter(set(words)))
["ain't", 'dictionary', 'it']
Upvotes: 4
Reputation: 1124658
Use a set to detect duplicates:
def double(filename):
seen = set()
with open(filename, 'r') as infile:
for line in l:
for word in line.split():
if word in seen:
return True
seen.add(word)
return False
You could shorten that to:
def double(filename):
seen = set()
with open(filename, 'r') as infile:
return any(word in seen or seen.add(word) for line in l for word in line.split())
Both versions exit early; as soon as a duplicate word is found, the function returns True
; it does have to read the whole file to determine there are no duplicates and return False
.
Upvotes: 0
Reputation: 23394
Another general approach to detecting duplicate words, involving collections.Counter
from itertools import chain
from collections import Counter
with open('test_file.txt') as f:
x = Counter(chain.from_iterable(line.split() for line in f))
for (key, value) in x.iteritems():
if value > 1:
print key
Upvotes: 0
Reputation: 792
a = set()
for line in l:
if (line in a):
return True
a.add(line)
return False
Upvotes: 0
Reputation: 85582
def has_duplicates(filename):
seen = set()
for line in open(filename):
for word in line.split():
if word in seen:
return True
seen.add(word)
return False
Upvotes: 3