Reputation: 79
So, I'm doing an exercise using python, and I tried to use the terminal to do step by step to understand what's happening but I didn't.
I want to understand mainly why the conditional return just the index 0.
Looking 'casino' in [Casinoville].lower()
isn't the same thing?
def word_search(documents, keyword):
indices = []
for i, doc in enumerate(documents):
tokens = doc.split()
normalized = [token.rstrip('.,').lower() for token in tokens]
if keyword.lower() in normalized:
indices.append(i)
return indices
def word_search(documents, keyword):
return [i for i, word in enumerate(doc_list) if keyword.lower() in word.rstrip('.,').lower()]
>>> doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
>>> word_search(doc_list, 'casino')
>>> [0]
>>> word_search(doc_list, 'casino')
>>> [0, 2]
Upvotes: 1
Views: 334
Reputation: 104722
When you use the in
operator, the result depends on the type of object on the right hand side. When it's a list (or most other kinds of containers), you get an exact membership test. So 'casino' in ['casino']
is True
, but 'casino' in ['casinoville']
is False
because the strings are not equal.
When the right hand side of is
is a string though, it does something different. Rather than looking for an exact match against a single character (which is what strings contain if you think of them as sequences), it does a substring match. So 'casino' in 'casinoville'
is True
, as would be casino in 'montecasino'
or 'casino' in 'foocasinobar'
(it's not just prefixes that are checked).
For your problem, you want exact matches to whole words only. The reference solution uses str.split
to separate words (the with no argument it splits on any kind of whitespace). It then cleans up the words a bit (stripping off punctuation marks), then does an in
match against the list of strings.
Your code never splits the strings you are passed. So when you do an in
test, you're doing a substring match on the whole document, and you'll get false positives when you match part of a larger word.
Upvotes: 0
Reputation: 5502
Let's try to understand the difference.
The "result" function can be written with list-comprehension:
def word_search(documents, keyword):
return [i for i, word in enumerate(documents)
if keyword.lower() in
[token.rstrip('.,').lower() for token in word.split()]]
The problem happens with the string : "Casinoville"
at index 2
.
See the output:
print([token.rstrip('.,').lower() for token in doc_list[2].split()])
# ['casinoville']
And here is the matter: you try to ckeck if a word is in the list. The answer is True
only if all the string matches (this is the expected output).
However, in your solution, you only check if a word contains a substring. In this case, the condition in
is on the string
itself and not the list
.
See it:
# On the list :
print('casino' in [token.rstrip('.,').lower() for token in doc_list[2].split()])
# False
# On the string:
print('casino' in [token.rstrip('.,').lower() for token in doc_list[2].split()][0])
# True
As result, in the first case, "Casinoville"
isn't included while it is in the second one.
Hope that helps !
Upvotes: 2
Reputation: 325
The question is "Returns list of the index values into the original list for all documents containing the keyword".
you need to consider word only.
In "Casinoville" case, word "casino" is not in, since this case only have word "Casinoville".
Upvotes: 0