Reputation: 151
For some reason,
m = re.search('(<[^pib(strong)(br)].*?>|</[^pib(strong)]>)', '</b>')
matches the string, but
m = re.search('(</[^pib(strong)]>)', '</b>')
does not. I am trying to match all tags that are not
<p>, <b>, </p>, </b>
and so on. Am I misunderstanding something about how '|' works?
Upvotes: 0
Views: 288
Reputation: 31524
You're doing it wrong. First of all, characters between []
are matched differently: [ab]
will match either a
or b
, so in your case [^pib(strong)]
will match everything that is not a p
, an i
, a b
, a (
, etc. (note the negation from ^
). Your first regex matching is merely a coincidence.
Also, you shouldn't be parsing html/xml with regex. Instead, use a proper xml parsing library, like lxml or beautifulsoup.
Here's a simple example with lxml
:
from lxml import html
dom = html.fromstring(your_code)
illegal = set(dom.cssselect('*')) - set(dom.cssselect('p,b'))
for tag in illegal:
do_something_with(tag)
(this is a small, probably sub-optimal example; it serves just to show you how easy it is to use such a library. Also, note that the library will wrap the code in a <p>
, so you should take that into consideration)
Upvotes: 2