Reputation: 83
So, I was writing a program for an assignment on Coursera, I solved it but I got some unintended behavior. The following code with the input of romeo.txt:
fname = input("Enter file name: ")
fh = open(fname, 'r')
lst = list()
words = ''
fin = list()
for line in fh:
words += line.strip(' ')
words = words.replace('\n', ' ')
for line in words:
lst += words.split(' ')
print(lst)
Instead of giving me a list of words only appearing once, it gives me every word, but repeated an unknown number of times.
Gives me a huge list
of repeating words: ['But', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks', 'It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun', 'Arise', 'fair', 'sun', 'and', 'kill', 'the', 'envious', 'moon', 'Who', 'is', 'already', 'sick', 'and', 'pale', 'with', 'grief', 'But', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks', 'It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun', 'Arise', 'fair', 'sun', 'and', 'kill', 'the', 'envious', 'moon', 'Who', 'is', 'already', 'sick', 'and', 'pale', 'with', 'grief', 'But', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks', 'It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun', 'Arise', 'fair', 'sun', 'and', 'kill', 'the', 'envious', 'moon', 'Who', 'is', 'already', 'sick', 'and', 'pale', 'with', 'grief', 'But', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks', 'It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun', 'Arise', 'fair', 'sun' . . . .,
The words repeat SO much more than that.
Upvotes: 1
Views: 101
Reputation: 1688
Python lists are not expected to be unique. They preserve the order in which things were inserted. If you want the unique set of words, use Python set
. You can create a set by passing a list to it, as in changing your last line to
print(set(lst))
or you can create an empty set and then add words to it as you come across them, something like this:
s = set()
...
for... :
s.update(words.split(' '))
Upvotes: 1
Reputation: 10799
Initially you said:
words = ''
Ok. So words
is a string.
Then, you said:
for line in fh:
words += line.strip(' ')
For every line in the file, strip spaces from the current line and append it to words
. Each iteration you are appending to your words
string. When the loop is done, words
will be one giant string.
Then, you said:
words = words.replace('\n', ' ')
Ok. words
is still a string. All you've done is replace all newline characters with spaces.
Then, you said:
for line in words:
lst += words.split(' ')
line
in this case, is not a good name for this temporary variable, since you are not iterating over the lines anymore. Your iterable is words
, which is a string. When you iterate over a string, you get individual characters, not lines:
>>> for line in "abcdefg":
print(line)
a
b
c
d
e
f
g
>>>
Just because I'm calling the temporary variable line
, doesn't mean that that's what it actually is. I could have called it anything and I still would have received the same output. A better name for this variable, therefore, would be char
, for example.
Back to your snippet, since you are iterating over the characters in your words
string, you are extending your list
with the result of words.split(' ')
, once for every character! I don't need to see your input file to know that that's a gigantic list. The number of strings in your lst
list will be approximately equal to the number of words in the file multiplied by the number of characters in the file.
Upvotes: 2
Reputation: 2112
Not sure what the actual question is, but if you want to have something like a list that doesn't allow duplicates, the datatype you want is a set. Sets don't allow duplicates so if you try to add a string to a set that's already there it will just skip it. Try initializing sets instead of lists Just a heads up as well you can initialize blank lists like this:
lst = []
Upvotes: 0