Reputation: 35
I am having some problems with a code. I am trying to find repeated words in a file, such as "the the" and then print the line that it happens on. So far my code works for the line count, but gives me all the words that are repeated throughout the whole file and not just the ones right after another.
What do I need to change so it only counts the doubled words?
my_file = input("Enter file name: ")
lst = []
count = 1
with open(my_file, "r") as dup:
for line in dup:
linedata = line.split()
for word in linedata:
if word not in lst:
lst.append(word)
else:
print("Found word: {""} on line {}".format(word, count))
count = count + 1
dup.close()
Upvotes: 3
Views: 88
Reputation:
Put the code below in a file named THISfile.py and execute it to see what is does:
# myFile = input("Enter file name: ")
# line No 2: line with with double 'with'
# line No 3: double ( word , word ) is not a double word
myFile="THISfile.py"
lstUniqueWords = []
noOfFoundWordDoubles = 0
totalNoOfWords = 0
lineNo = 0
lstLineNumbersWithWordDoubles = []
with open(myFile, "r") as myFile:
for line in myFile:
lineNo+=1 # memorize current line number
lineWords = line.split()
if len(lineWords) > 0: # scan line only if it contains words
currWord = lineWords[0] # remember already 'visited' word
totalNoOfWords += 1
if currWord not in lstUniqueWords:
lstUniqueWords.append(currWord)
# put 'visited' word word into lstAllWordsINmyFile (if it is not already there)
lastWord = currWord # we are done with current, so current becomes last one
if len(lineWords) > 1 : # proceed only if line has two or more words
for word in lineWords[1:] : # loop over all other words
totalNoOfWords += 1
currWord = word
if currWord not in lstUniqueWords:
lstUniqueWords.append(currWord)
# put 'visited' word into lstAllWordsINmyFile (if it is not already there)
if( currWord == lastWord ): # duplicate word found:
noOfFoundWordDoubles += 1
print("Found double word: ['{""}'] in line {}".format(currWord, lineNo))
lstLineNumbersWithWordDoubles.append(lineNo)
lastWord = currWord
# ^--- now after all all work is done, the currWord is considered lastWord
print(
"noOfDoubles", noOfFoundWordDoubles, "\n",
"totalNoOfWords", totalNoOfWords, "uniqueWords", len(lstUniqueWords), "\n",
"linesWithDoubles", lstLineNumbersWithWordDoubles
)
The output should be:
Found double word: ['with'] in line 2
Found double word: ['word'] in line 19
Found double word: ['all'] in line 33
noOfDoubles 3
totalNoOfWords 221 uniqueWords 111
linesWithDoubles [2, 19, 33]
Now you can check out the comments in the code to get better understanding how it works.
Upvotes: 0
Reputation:
Here only the pure answer to the question asked:
"What do I need to change so it only counts the doubled words?"
Here you are:
my_file = input("Enter file name: ")
count = 0
with open(my_file, "r") as dup:
for line in dup:
count = count + 1
linedata = line.split()
lastWord = ''
for word in linedata:
if word == lastWord:
print("Found word: {""} on line {}".format(word, count))
lastWord = word
dup.close()
Upvotes: 0
Reputation: 3234
my_file = input("Enter file name: ")
with open(my_file, "r") as dup:
for line_num, line in enumerate(dup):
words_in_line = line.split()
duplicates = [word for i, word in enumerate(words_in_line[1:]) if words_in_line[i] == word]
# now you have a list of duplicated words in line in duplicates
# do whatever you want with it
Upvotes: 1