MATT_BOOTY
MATT_BOOTY

Reputation: 83

The nature of lists in python, why do I get a repeating list?

So, I was writing a program for an assignment on Coursera, I solved it but I got some unintended behavior. The following code with the input of romeo.txt:

fname = input("Enter file name: ")
fh = open(fname, 'r')
lst = list()
words = ''
fin = list()
for line in fh:
    words += line.strip(' ')

words = words.replace('\n', ' ')

for line in words:
    lst += words.split(' ')
print(lst)

Instead of giving me a list of words only appearing once, it gives me every word, but repeated an unknown number of times.

Gives me a huge list
of repeating words: ['But', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks', 'It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun', 'Arise', 'fair', 'sun', 'and', 'kill', 'the', 'envious', 'moon', 'Who', 'is', 'already', 'sick', 'and', 'pale', 'with', 'grief', 'But', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks', 'It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun', 'Arise', 'fair', 'sun', 'and', 'kill', 'the', 'envious', 'moon', 'Who', 'is', 'already', 'sick', 'and', 'pale', 'with', 'grief', 'But', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks', 'It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun', 'Arise', 'fair', 'sun', 'and', 'kill', 'the', 'envious', 'moon', 'Who', 'is', 'already', 'sick', 'and', 'pale', 'with', 'grief', 'But', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks', 'It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun', 'Arise', 'fair', 'sun' . . . ., 

The words repeat SO much more than that.

Upvotes: 1

Views: 101

Answers (3)

Joe Germuska
Joe Germuska

Reputation: 1688

Python lists are not expected to be unique. They preserve the order in which things were inserted. If you want the unique set of words, use Python set. You can create a set by passing a list to it, as in changing your last line to

print(set(lst))

or you can create an empty set and then add words to it as you come across them, something like this:

s = set()
...
for... :
  s.update(words.split(' '))

Upvotes: 1

Paul M.
Paul M.

Reputation: 10799

Initially you said:

words = ''

Ok. So words is a string. Then, you said:

for line in fh:
    words += line.strip(' ')

For every line in the file, strip spaces from the current line and append it to words. Each iteration you are appending to your words string. When the loop is done, words will be one giant string.

Then, you said:

words = words.replace('\n', ' ')

Ok. words is still a string. All you've done is replace all newline characters with spaces.

Then, you said:

for line in words:
    lst += words.split(' ')

line in this case, is not a good name for this temporary variable, since you are not iterating over the lines anymore. Your iterable is words, which is a string. When you iterate over a string, you get individual characters, not lines:

>>> for line in "abcdefg":
    print(line)


a
b
c
d
e
f
g
>>> 

Just because I'm calling the temporary variable line, doesn't mean that that's what it actually is. I could have called it anything and I still would have received the same output. A better name for this variable, therefore, would be char, for example.

Back to your snippet, since you are iterating over the characters in your words string, you are extending your list with the result of words.split(' '), once for every character! I don't need to see your input file to know that that's a gigantic list. The number of strings in your lst list will be approximately equal to the number of words in the file multiplied by the number of characters in the file.

Upvotes: 2

DBA108642
DBA108642

Reputation: 2112

Not sure what the actual question is, but if you want to have something like a list that doesn't allow duplicates, the datatype you want is a set. Sets don't allow duplicates so if you try to add a string to a set that's already there it will just skip it. Try initializing sets instead of lists Just a heads up as well you can initialize blank lists like this:

lst = []

Upvotes: 0

Related Questions