Reputation: 33
I am trying to read a quote from a text file and find any duplicated words that appear next to each other. The following is the quote:
"He that would make his own liberty liberty secure,
must guard even his enemy from oppression;
for for if he violates this duty, he
he establishes a precedent that will reach to himself."
-- Thomas Paine
The output should be the following:
Found word: "Liberty" on line 1
Found word: "for" on line 3
Found word: "he" on line 4
I have written the code to read the text from the file but I am having trouble with the code to identify the duplicates. I have tried enumerating each word in the file and checking if the word at one index is equal to the the word at the following index. However, I am getting an index error because the loop continues outside of the index range. Here's what I've come up with so far:
import string
file_str = input("Enter file name: ")
input_file = open(file_str, 'r')
word_list = []
duplicates = []
for line in input_file:
line_list = line_str.split()
for word in line_list:
if word != "--":
word_list.append(word)
for idx, word in enumerate(word_list):
print(idx, word)
if word_list[idx] == word_list[idx + 1]:
duplicates.append(word)
Any help with the current method I'm trying would be appreciated, or suggestions for another method.
Upvotes: 1
Views: 196
Reputation: 78750
Here's another approach.
from itertools import tee, izip
from collections import defaultdict
dups = defaultdict(set)
with open('file.txt') as f:
for no, line in enumerate(f, 1):
it1, it2 = tee(line.split())
next(it2, None)
for word, follower in izip(it1, it2):
if word != '--' and word == follower:
dups[no].add(word)
which yields
>>> dups
defaultdict(<type 'set'>, {1: set(['liberty']), 3: set(['for'])})
which is a dictionary which holds a set of pair-duplicates for each line, e.g.
>>> dups[3]
set(['for'])
(I don't know why you expect "he" to be found on line four, it is certainly not doubled in your sample file.)
Upvotes: 0
Reputation:
This should do the trick OP. In the for loop over the word list it only goes up to the second to last element now. This won't keep track of the line numbers though, I would use Phillip Martin's solution for that.
import string
file_str = input("Enter file name: ")
input_file = open(file_str, 'r')
word_list = []
duplicates = []
for line in input_file:
line_list = line.split()
for word in line_list:
if word != "--":
word_list.append(word)
#Here is the change I made > <
for idx, word in enumerate(word_list[:-1]):
print(idx, word)
if word_list[idx] == word_list[idx + 1]:
duplicates.append(word)
print duplicates
Upvotes: 0
Reputation: 1960
When you record the word_list
you are losing information about which line the word is on.
Perhaps better would be to determine duplicates as you read the lines.
line_number = 1
for line in input_file:
line_list = line_str.split()
previous_word = None
for word in line_list:
if word != "--":
word_list.append(word)
if word == previous_word:
duplicates.append([word, line_number])
previous_word = word
line_number += 1
Upvotes: 0