Matthew Leonard
Matthew Leonard

Reputation: 33

How to read text from a file, identify adjacent duplicated words, and report their location in the text file?

I am trying to read a quote from a text file and find any duplicated words that appear next to each other. The following is the quote:

"He that would make his own liberty liberty secure,

must guard even his enemy from oppression;

for for if he violates this duty, he

he establishes a precedent that will reach to himself."
-- Thomas Paine

The output should be the following:

Found word: "Liberty" on line 1

Found word: "for" on line 3

Found word: "he" on line 4

I have written the code to read the text from the file but I am having trouble with the code to identify the duplicates. I have tried enumerating each word in the file and checking if the word at one index is equal to the the word at the following index. However, I am getting an index error because the loop continues outside of the index range. Here's what I've come up with so far:

import string
file_str = input("Enter file name: ")
input_file = open(file_str, 'r')

word_list = []
duplicates = []

for line in input_file:
    line_list = line_str.split()
    for word in line_list:
        if word != "--":
            word_list.append(word)

for idx, word in enumerate(word_list):
    print(idx, word)
    if word_list[idx] == word_list[idx + 1]:
        duplicates.append(word)

Any help with the current method I'm trying would be appreciated, or suggestions for another method.

Upvotes: 1

Views: 196

Answers (3)

timgeb
timgeb

Reputation: 78750

Here's another approach.

from itertools import tee, izip
from collections import defaultdict

dups = defaultdict(set)
with open('file.txt') as f:
    for no, line in enumerate(f, 1):
        it1, it2 = tee(line.split())
        next(it2, None)
        for word, follower in izip(it1, it2):
            if word != '--' and word == follower:
                dups[no].add(word)

which yields

>>> dups
defaultdict(<type 'set'>, {1: set(['liberty']), 3: set(['for'])})

which is a dictionary which holds a set of pair-duplicates for each line, e.g.

>>> dups[3]
set(['for'])

(I don't know why you expect "he" to be found on line four, it is certainly not doubled in your sample file.)

Upvotes: 0

user7492117
user7492117

Reputation:

This should do the trick OP. In the for loop over the word list it only goes up to the second to last element now. This won't keep track of the line numbers though, I would use Phillip Martin's solution for that.

import string

file_str = input("Enter file name: ")
input_file = open(file_str, 'r')

word_list = []
duplicates = []

for line in input_file:
    line_list = line.split()
    for word in line_list:
        if word != "--":
            word_list.append(word)
#Here is the change I made         >     <
for idx, word in enumerate(word_list[:-1]):
    print(idx, word)
    if word_list[idx] == word_list[idx + 1]:
        duplicates.append(word)
print duplicates

Upvotes: 0

Phillip Martin
Phillip Martin

Reputation: 1960

When you record the word_list you are losing information about which line the word is on.

Perhaps better would be to determine duplicates as you read the lines.

line_number = 1
for line in input_file:
    line_list = line_str.split()
    previous_word = None
    for word in line_list:
        if word != "--":
            word_list.append(word)
        if word == previous_word:
            duplicates.append([word, line_number])
        previous_word = word
    line_number += 1

Upvotes: 0

Related Questions