Reputation: 1257
I have a list of words and phrases:
words = ['hi', 'going on', 'go']
And I have a transcript:
transcript ="hi how are you. I am good. what's going on.".split('.')
I need to find matches in this transcript. For the example above, matches are in the first and third elements of the transcript.
I followed answers from here and I tried to use the following code:
for i in range(len(transcript)):
if any(word in transcript[i] for word in words):
print(i)
Its output is:
1
2
3
But it is not what I want. I want to exclude 'i am good' sentences from the output. The expected output is:
1
3
Upvotes: 0
Views: 61
Reputation: 42133
The issue is that you are not limiting your search to whole expressions. This means that any word that can appear as a substring of another word (e.g. "go" is a substring of "good") will be treated as a match.
This would call for use of regular expressions (the re module)
Alternatively, you could transform every non-letter characters into spaces, and then perform the search with padded spaces around the words and text so that you only find whole word (whole expressions in your case).
For example:
# translation table for all non-letters to spaces
from string import printable,ascii_letters
spaces = str.maketrans({nl:" " for nl in set(printable)-set(ascii_letters)})
words = ['hi', 'going on', 'go']
paddedWords = [f" {word} " for word in words]
transcript = "hi how are you. I am good. what's going on.".split('.')
for i,text in enumerate(transcript,1):
paddedText = f" {text.lower().translate(spaces)} "
if any( word in paddedText for word in paddedWords):
print(i)
# 1
# 3
Upvotes: 0
Reputation: 4472
You can try
for i in range(len(transcript)):
if any(word in [i for i in transcript[i].split(" ")] if len(word.split(" ")) < 2 else word in transcript[i] for word in words):
print(i+1)
That will output
1
3
This code will not check if the word is just a part of the transcript[i] like 'go' in 'good'.
Upvotes: 1
Reputation: 1672
The error is there because go
is present as a substring in I am good
.
You can try this in if
condition:-
if any(word in transcript[i].split() if len(word.split()) < 2 else word in transcript[i] for word in words):
print(i+1)
This will give you the desired output.
1
3
Upvotes: 0