Reputation: 79

Doubts about string

So, I'm doing an exercise using python, and I tried to use the terminal to do step by step to understand what's happening but I didn't.

I want to understand mainly why the conditional return just the index 0. Looking 'casino' in [Casinoville].lower() isn't the same thing?

Exercise:

Takes a list of documents (each document is a string) and a keyword. Returns list of the index values into the original list for all documents containing the keyword.

Exercise solution

def word_search(documents, keyword):
    indices = [] 
    for i, doc in enumerate(documents):

        tokens = doc.split()
        normalized = [token.rstrip('.,').lower() for token in tokens]

        if keyword.lower() in normalized:
            indices.append(i)
    return indices

My solution

def word_search(documents, keyword):
    return [i for i, word in enumerate(doc_list) if keyword.lower() in word.rstrip('.,').lower()]

Run

>>> doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]

Expected output

>>> word_search(doc_list, 'casino')
>>> [0]

Actual output

>>> word_search(doc_list, 'casino')
>>> [0, 2]

Upvotes: 1

Answers (3)

Blckknght

Reputation: 104722

When you use the in operator, the result depends on the type of object on the right hand side. When it's a list (or most other kinds of containers), you get an exact membership test. So 'casino' in ['casino'] is True, but 'casino' in ['casinoville'] is False because the strings are not equal.

When the right hand side of is is a string though, it does something different. Rather than looking for an exact match against a single character (which is what strings contain if you think of them as sequences), it does a substring match. So 'casino' in 'casinoville' is True, as would be casino in 'montecasino' or 'casino' in 'foocasinobar' (it's not just prefixes that are checked).

For your problem, you want exact matches to whole words only. The reference solution uses str.split to separate words (the with no argument it splits on any kind of whitespace). It then cleans up the words a bit (stripping off punctuation marks), then does an in match against the list of strings.

Your code never splits the strings you are passed. So when you do an in test, you're doing a substring match on the whole document, and you'll get false positives when you match part of a larger word.

Upvotes: 0

Alexandre B.

Reputation: 5502

Let's try to understand the difference.

The "result" function can be written with list-comprehension:

def word_search(documents, keyword):
    return [i for i, word in enumerate(documents) 
                if keyword.lower() in 
                    [token.rstrip('.,').lower() for token in word.split()]]

The problem happens with the string : "Casinoville" at index 2.

See the output:

print([token.rstrip('.,').lower() for token in doc_list[2].split()])
# ['casinoville']

And here is the matter: you try to ckeck if a word is in the list. The answer is True only if all the string matches (this is the expected output).

However, in your solution, you only check if a word contains a substring. In this case, the condition in is on the string itself and not the list.

See it:

# On the list : 
print('casino' in [token.rstrip('.,').lower() for token in doc_list[2].split()])
# False

# On the string:
print('casino' in [token.rstrip('.,').lower() for token in doc_list[2].split()][0])
# True

As result, in the first case, "Casinoville" isn't included while it is in the second one.

Hope that helps !

Upvotes: 2

CYC

Reputation: 325

The question is "Returns list of the index values into the original list for all documents containing the keyword".

you need to consider word only.

In "Casinoville" case, word "casino" is not in, since this case only have word "Casinoville".