Python Regular Expression OR not matching

Question

For some reason,

m = re.search('(<[^pib(strong)(br)].*?>|)', '')

matches the string, but

m = re.search('()', '')

does not. I am trying to match all tags that are not

, , 
,

and so on. Am I misunderstanding something about how '|' works?

Gabi Purcaru · Accepted Answer

You're doing it wrong. First of all, characters between [] are matched differently: [ab] will match either a or b, so in your case [^pib(strong)] will match everything that is not a p, an i, a b, a (, etc. (note the negation from ^). Your first regex matching is merely a coincidence.

Also, you shouldn't be parsing html/xml with regex. Instead, use a proper xml parsing library, like lxml or beautifulsoup.

Here's a simple example with lxml:

from lxml import html
dom = html.fromstring(your_code)
illegal = set(dom.cssselect('*')) - set(dom.cssselect('p,b'))
for tag in illegal:
    do_something_with(tag)

(this is a small, probably sub-optimal example; it serves just to show you how easy it is to use such a library. Also, note that the library will wrap the code in a

, so you should take that into consideration)

Python Regular Expression OR not matching

Answers (1)

Related Questions