Reputation: 57
I have a dataset and one of the column contains sentences, in some of sentences the words are stucking together. i want to extract this words if there appears on each row. ingredients_list=['water','milk', 'yeast', 'banana', 'sugar', 'ananas']. I use this code for extracting the words
ingredients_list=['water','milk', 'yeast', 'banana', 'sugar', 'ananas']
path = '|'.join(r"\b{}\b".format(x) for x in ingredients_list)
ing_l = df['ingredients'].str.findall(pat, flags=re.I).str.join(' ')
ing_l= ing_l.replace("","Unknown")
Its works great but, it didn't extract words from ingredients_list, if one of the words are stuck with another, i mean in a sentence "breadmilkcoffee" it fails to extract "milk" among this stucking words. I asking a related question for helping me to order the words i extract, Sort the values of first list using second list with different length in Python . But i didn't extract all the words. Do you have any solution to this problem? Thank you a lot
Upvotes: 0
Views: 674
Reputation: 1334
You are using the \b
special character, which asserts that the pattern appears at a word boundary.
Removing this should allow you to match items in ingredients_list
when they are not separated by a space from the rest of the string.
ingredients_list=['water','milk', 'yeast', 'banana', 'sugar', 'ananas']
path = '|'.join(r"{}".format(x) for x in ingredients_list)
ing_l = df['ingredients'].str.findall(pat, flags=re.I).str.join(' ')
ing_l= ing_l.replace("","Unknown")
Upvotes: 1