Sarah Nguyen
Sarah Nguyen

Reputation: 151

Python Regular Expression OR not matching

For some reason,

m = re.search('(<[^pib(strong)(br)].*?>|</[^pib(strong)]>)', '</b>')

matches the string, but

m = re.search('(</[^pib(strong)]>)', '</b>')

does not. I am trying to match all tags that are not

<p>, <b>, </p>, </b>

and so on. Am I misunderstanding something about how '|' works?

Upvotes: 0

Views: 288

Answers (1)

Gabi Purcaru
Gabi Purcaru

Reputation: 31524

You're doing it wrong. First of all, characters between [] are matched differently: [ab] will match either a or b, so in your case [^pib(strong)] will match everything that is not a p, an i, a b, a (, etc. (note the negation from ^). Your first regex matching is merely a coincidence.

Also, you shouldn't be parsing html/xml with regex. Instead, use a proper xml parsing library, like lxml or beautifulsoup.

Here's a simple example with lxml:

from lxml import html
dom = html.fromstring(your_code)
illegal = set(dom.cssselect('*')) - set(dom.cssselect('p,b'))
for tag in illegal:
    do_something_with(tag)

(this is a small, probably sub-optimal example; it serves just to show you how easy it is to use such a library. Also, note that the library will wrap the code in a <p>, so you should take that into consideration)

Upvotes: 2

Related Questions