simplyharsh
simplyharsh

Reputation: 36373

Can python regex negate a list of words?

I have to match all the alphanumeric words from a text.

>>> import re
>>> text = "hello world!! how are you?"
>>> final_list = re.findall(r"[a-zA-Z0-9]+", text)
>>> final_list
['hello', 'world', 'how', 'are', 'you']
>>>

This is fine, but further I have few words to negate i.e. the words that shouldn't be in my final list.

>>> negate_words = ['world', 'other', 'words']

A bad way to do it

>>> negate_str = '|'.join(negate_words)
>>> filter(lambda x: not re.match(negate_str, x), final_list)
['hello', 'how', 'are', 'you']

But i can save a loop if my very first regex-pattern can be changed to consider negation of those words. I found negation of characters but i have words to negate, also i found regex-lookbehind in other questions, but that doesn't help either.

Can it be done using python re?

Update

My text can span a few hundered lines. Also, list of negate_words can be lengthy too.

Considering this, is using regex for such task, correct in the first place?? Any suggestions??

Upvotes: 0

Views: 2982

Answers (3)

eyquem
eyquem

Reputation: 27575

Don't ask uselessly too much to regex.
Instead, think to generators.

import re

unwanted = ('world', 'other', 'words')

text = "hello world!! how are you?"

gen = (m.group() for m in re.finditer("[a-zA-Z0-9]+",text))
li = [ w for w in gen if w not in unwanted ]

And a generator can be created instead of li, also

Upvotes: -1

jcollado
jcollado

Reputation: 40394

Maybe is worth trying pyparsing for this:

>>> from pyparsing import *

>>> negate_words = ['world', 'other', 'words']
>>> parser = OneOrMore(Suppress(oneOf(negate_words)) ^ Word(alphanums)).ignore(CharsNotIn(alphanums))
>>> parser.parseString('hello world!! how are you?').asList()
['hello', 'how', 'are', 'you']

Note that oneOf(negate_words) must be before Word(alphanums) to make sure that it matches earlier.

Edit: Just for the fun of it, I repeated the exercise using lepl (also an interesting parsing library)

>>> from lepl import *

>>> negate_words = ['world', 'other', 'words']
>>> parser = OneOrMore(~Or(*negate_words) | Word(Letter() | Digit()) | ~Any())
>>> parser.parse('hello world!! how are you?')
['hello', 'how', 'are', 'you']

Upvotes: 1

Raymond Hettinger
Raymond Hettinger

Reputation: 226376

I don't think there is a clean way to do this using regular expressions. The closest I could find was bit ugly and not exactly what you wanted:

>>> re.findall(r"\b(?:world|other|words)|([a-zA-Z0-9]+)\b", text)
['hello', '', 'how', 'are', 'you']

Why not use Python's sets instead. They are very fast:

>>> list(set(final_list) - set(negate_words))
['hello', 'how', 'are', 'you']

If order is important, see the reply from @glglgl below. His list comprehension version is very readable. Here's a fast but less readable equivalent using itertools:

>>> negate_words_set = set(negate_words)
>>> list(itertools.ifilterfalse(negate_words_set.__contains__, final_list))
['hello', 'how', 'are', 'you']

Another alternative is the build-up the word list in a single pass using re.finditer:

>>> result = []
>>> negate_words_set = set(negate_words)
>>> result = []
>>> for mo in re.finditer(r"[a-zA-Z0-9]+", text):
    word = mo.group()
    if word not in negate_words_set:
         result.append(word)

>>> result
['hello', 'how', 'are', 'you']

Upvotes: 6

Related Questions