paul
paul

Reputation: 57

How to extract list of words out of a string with no spaces

I have a dataset and one of the column contains sentences, in some of sentences the words are stucking together. i want to extract this words if there appears on each row. ingredients_list=['water','milk', 'yeast', 'banana', 'sugar', 'ananas']. I use this code for extracting the words

ingredients_list=['water','milk', 'yeast', 'banana', 'sugar', 'ananas']
path = '|'.join(r"\b{}\b".format(x) for x in ingredients_list)
ing_l = df['ingredients'].str.findall(pat, flags=re.I).str.join(' ')
ing_l= ing_l.replace("","Unknown")

Its works great but, it didn't extract words from ingredients_list, if one of the words are stuck with another, i mean in a sentence "breadmilkcoffee" it fails to extract "milk" among this stucking words. I asking a related question for helping me to order the words i extract, Sort the values of first list using second list with different length in Python . But i didn't extract all the words. Do you have any solution to this problem? Thank you a lot

Upvotes: 0

Views: 674

Answers (1)

Anil
Anil

Reputation: 1334

You are using the \b special character, which asserts that the pattern appears at a word boundary.

Removing this should allow you to match items in ingredients_list when they are not separated by a space from the rest of the string.

ingredients_list=['water','milk', 'yeast', 'banana', 'sugar', 'ananas']
path = '|'.join(r"{}".format(x) for x in ingredients_list)
ing_l = df['ingredients'].str.findall(pat, flags=re.I).str.join(' ')
ing_l= ing_l.replace("","Unknown")

Upvotes: 1

Related Questions