IronWaffleMan
IronWaffleMan

Reputation: 2692

Find (possibly multi-word) phrase inside sentence in Python

I'm trying to find keywords within a sentence, where the keywords are usually single words, but can be multi-word combos (like "cost in euros"). So if I have a sentence like cost in euros of bacon it would find cost in euros in that sentence and return true.

For this, I was using this code:

if any(phrase in line for phrase in keyword['aliases']:

where line is the input and aliases is an array of phrases that match a keyword (like for cost in euros, it's ['cost in euros', 'euros', 'euro cost']).

However, I noticed that it was also triggering on word parts. For example, I had a match phrase of y and a sentence of trippy cake. I'd not expect this to return true, but it does, since it apparently finds the y in trippy. How do I get this to only check whole words? Originally I was doing this keyword search with a list of words (essentially doing line.split() and checking those), but that doesn't work for multi-word keyword aliases.

Upvotes: 2

Views: 509

Answers (1)

Chris Larson
Chris Larson

Reputation: 1714

This should accomplish what you're looking for:

import re

aliases = [
    'cost.',
    '.cost',
    '.cost.',
    'cost in euros of bacon',
    'rocking euros today',
    'there is a cost inherent to bacon',
    'europe has cost in place',
    'there is a cost.',
    'I was accosted.',
    'dealing with euro costing is painful']
phrases = ['cost in euros', 'euros', 'euro cost', 'cost']

matched = list(set([
    alias
    for alias in aliases
    for phrase in phrases
    if re.search(r'\b{}\b'.format(phrase), alias)
    ]))

print(matched)

Output:

['there is a cost inherent to bacon', '.cost.', 'rocking euros today', 'there is a cost.', 'cost in euros of bacon', 'europe has cost in place', 'cost.', '.cost']

Basically, we're grabbing all matches, using pythons re module as our test, including cases where multiple phrases occur in a given alias, using a compound list comprehension, then using set() to trim duplicates from the list, then using list() to coerce the set back into a list.

Refs:

Lists: https://docs.python.org/3/tutorial/datastructures.html#more-on-lists

List comprehensions: https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions

Sets: https://docs.python.org/3/tutorial/datastructures.html#sets

re (or regex): https://docs.python.org/3/library/re.html#module-re

Upvotes: 2

Related Questions