usr_lal123
usr_lal123

Reputation: 838

Extract space separated words from a sentence in Python

I have list of strings say, x1 = ['esk','wild man','eskimo', 'sta','(+)-6-[amina(4-chlora)(1-metha-1h-imidol-5-yl)mhyl]-4-(3-chlora)-1-methyl-2(1h)-quinoa'] I need to extract the x1s present in few sentences.

My sentence is "eskimo lives as a wild man in wild jungle and he stands as a guard". In the sentence, I need to extract first word eskimo and the seventh and eighth words wild man and they are separate words as in x1. I should not extract "stands" even though sta is present in stands.

def get_name(input_str):

 prod_name= []
    for row in x1:
        if (row.strip().lower()in input_str.lower().strip()) or (len([x for x in input_str.split() if "\b"+x in row])>0):
            prod_name.append(row) 
return list(set(prod_name))

The function get_name("eskimo lives as a wild man in wild jungle and he stands as a guard") returns

[esk, eskimo,wild man,sta]

But the expected is

[eskimo,wild man]

May I know what has to be changed in the code?

Upvotes: 2

Views: 755

Answers (4)

The fourth bird
The fourth bird

Reputation: 163362

You could use a regex with whitespace boundaries on the left (?<!\S) and right (?!\S) to not get partial matches, and join all the items from the x1 list.

Then use re.findall to get all the matches:

import re

x1 = ['esk','wild man','eskimo', 'sta','(+)-6-[amina(4-chlora)(1-metha-1h-imidol-5-yl)mhyl]-4-(3-chlora)-1-methyl-2(1h)-quinoa']
s = "eskimo lives as a wild man in wild jungle and he stands as a guard"
pattern = fr"(?<!\S)(?:{'|'.join(re.escape(x) for x in x1)})(?!\S)"

print(re.findall(pattern, s))

Output

['eskimo', 'wild man']

See a Python demo.

Upvotes: 0

NicoCaldo
NicoCaldo

Reputation: 1577

You can use regular expressions

import re

x1 = ['esk','wild man','eskimo', 'sta']

my_str = "eskimo lives as a wild man in wild jungle and he stands as a guard"
my_list = []

for words in x1:
    if re.search(r'\b' + words + r'\b', my_str):
        my_list.append(words)
print(my_list)

According to the new list, because the string (+)-6-[amina(4-chlora)(1-metha-1h-imidol-5-yl)mhyl]-4-(3-chlora)-1-methyl-2(1h)-quinoa generate an error with regular expressions you can use a try except block

for words in x1:
  try:
    if re.search(r'\b' + words + r'\b', my_str):
      my_list.append(words)
  except:
    pass

Upvotes: 1

Praneeth Jain
Praneeth Jain

Reputation: 149

I have a slightly different approach. Firstly you could split the input sentence into words and also split each of the phrases you want to check for into constituent words. Then check if each of all words of a phrase are present in the sentence.

x1 = ['esk','wild man','eskimo', 'sta','(+)-6-[amina(4-chlora)(1-metha-1h-imidol-5-yl)mhyl]-4-(3-chlora)-1-methyl-2(1h)-quinoa']
input_sentence = "eskimo lives as a wild man in wild jungle and he stands as a guard"
# Remove all punctuation marks from the sentence
input_sentence = input_sentence.replace('!', '').replace('.', '').replace('?', '').replace(',', '')
# Split the input sentence into its component words to check individually
input_words = input_sentence.split()

for ele in x1:
    # Split each element in x1 into words
    ele_words = ele.split()
    # Check if all words are part of the input words
    if all(ele in input_words for ele in ele_words) and ele in input_sentence:
        print(ele)

Upvotes: 2

Utshaan
Utshaan

Reputation: 92

You could simply use str.split(" ") to get a list of all the words in the sentence, and then do the following:

s = "eskimo lives as a wild man in wild jungle and he stands as a guard"

l = s.split(" ")

x1 = ['esk','wild man','eskimo', 'sta','(+)-6-[amina(4-chlora)(1-metha-1h-imidol-5-yl)mhyl]-4-(3-chlora)-1-methyl-2(1h)-quinoa']
new_x1 = [word.split(" ") for word in x1 if " " in word] + [word for word in x1 if " " not in word]

ans = []

for x in new_x1:
    if type(x) == str:
        if x in l:
            ans.append(x)
    else:
        temp = ""
        for i in x:
            temp += i + " "
        temp = temp[:-1]
        if all(sub_x in l for sub_x in x) and temp in s:
            ans.append(temp)

print(ans)

Upvotes: 2

Related Questions