Reputation: 35

How to find doubled words in file?

I am having some problems with a code. I am trying to find repeated words in a file, such as "the the" and then print the line that it happens on. So far my code works for the line count, but gives me all the words that are repeated throughout the whole file and not just the ones right after another.

What do I need to change so it only counts the doubled words?

my_file = input("Enter file name: ")
lst = []
count = 1
with open(my_file, "r") as dup:
for line in dup:
    linedata = line.split()
    for word in linedata:
        if word not in lst:
            lst.append(word)
        else:
           print("Found word: {""} on line {}".format(word, count))
           count = count + 1
dup.close()

Upvotes: 3

Answers (3)

user7711283

Reputation:

Put the code below in a file named THISfile.py and execute it to see what is does:

# myFile = input("Enter file name: ")
# line No 2: line with with double 'with'
# line No 3: double ( word , word ) is not a double word
myFile="THISfile.py"
lstUniqueWords = []
noOfFoundWordDoubles = 0
totalNoOfWords       = 0
lineNo               = 0
lstLineNumbersWithWordDoubles = []
with open(myFile, "r") as myFile:
    for line in myFile:
        lineNo+=1 # memorize current line number 
        lineWords = line.split()
        if len(lineWords) > 0: # scan line only if it contains words
            currWord = lineWords[0] # remember already 'visited' word
            totalNoOfWords += 1
            if currWord not in lstUniqueWords: 
                lstUniqueWords.append(currWord) 
                # put 'visited' word word into lstAllWordsINmyFile (if it is not already there)
            lastWord = currWord # we are done with current, so current becomes last one
            if len(lineWords) > 1 : # proceed only if line has two or more words
                for word in lineWords[1:] : # loop over all other words
                    totalNoOfWords += 1
                    currWord = word
                    if currWord not in lstUniqueWords: 
                        lstUniqueWords.append(currWord) 
                        # put 'visited' word into lstAllWordsINmyFile (if it is not already there)
                    if( currWord == lastWord ): # duplicate word found: 
                        noOfFoundWordDoubles += 1
                        print("Found double word: ['{""}'] in line {}".format(currWord, lineNo))
                        lstLineNumbersWithWordDoubles.append(lineNo)
                    lastWord = currWord 
                    #        ^--- now after all all work is done, the currWord is considered lastWord
print(
    "noOfDoubles", noOfFoundWordDoubles, "\n",
    "totalNoOfWords", totalNoOfWords, "uniqueWords", len(lstUniqueWords), "\n",
    "linesWithDoubles", lstLineNumbersWithWordDoubles
)

The output should be:

Found double word: ['with'] in line 2
Found double word: ['word'] in line 19
Found double word: ['all'] in line 33
noOfDoubles 3 
 totalNoOfWords 221 uniqueWords 111 
 linesWithDoubles [2, 19, 33]

Now you can check out the comments in the code to get better understanding how it works.

Upvotes: 0

user7711283

Reputation:

Here only the pure answer to the question asked:

"What do I need to change so it only counts the doubled words?"

Here you are:

my_file = input("Enter file name: ")
count = 0
with open(my_file, "r") as dup:
for line in dup:
    count = count + 1
    linedata = line.split()
    lastWord = ''
    for word in linedata:
        if word == lastWord:
            print("Found word: {""} on line {}".format(word, count))
        lastWord = word
dup.close()

Upvotes: 0

Maciek

Reputation: 3234

my_file = input("Enter file name: ")
with open(my_file, "r") as dup:
    for line_num, line in enumerate(dup):
        words_in_line = line.split()
        duplicates = [word for i, word in enumerate(words_in_line[1:]) if words_in_line[i] == word]
        # now you have a list of duplicated words in line in duplicates
        # do whatever you want with it

Upvotes: 1

How to find doubled words in file?

Answers (3)

Related Questions