Andrius
Andrius

Reputation: 21188

Python - regular expressions - find every word except in tags

How to find all words except the ones in tags using RE module?

I know how to find something, but how to do it opposite way? Like I write something to search for, but acutally I want to search for every word except everything inside tags and tags themselves?

So far I managed this:

f = open (filename,'r')
data = re.findall(r"<.+?>", f.read())

Well it prints everything inside <> tags, but how to make it find every word except thats inside those tags? I tried ^, to use at the start of pattern inside [], but then symbols as . are treated literally without special meaning. Also I managed to solve this, by splitting string, using '''\= <>"''', then checking whole string for words that are inside <> tags (like align, right, td etc), and appending words that are not inside <> tags in another list. But that a bit ugly solution.

Is there some simple way to search for every word except anything that's inside <> and these tags themselves? So let say string 'hello 123 <b>Bold</b> <p>end</p>' with re.findall, would return:

['hello', '123', 'Bold', 'end']

Upvotes: 0

Views: 1061

Answers (4)

schesis
schesis

Reputation: 59238

If you want to avoid using a regular expression, BeautifulSoup makes it very easy to get just the text from an HTML document:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html_string)
text = "".join(soup.findAll(text=True))

From there, you can get the list of words with split:

words = text.split()

Upvotes: 2

Ωmega
Ωmega

Reputation: 43703

Using regex for this kind of task is not the best idea, as you cannot make it work for every case.

One of solutions that should catch most of such words is regex pattern

\b\w+\b(?![^<]*>)

Upvotes: 2

Billy Moon
Billy Moon

Reputation: 58619

Strip out all the tags (using your original regex), then match words.

The only weakness is if there are <s in the strings other than as tag delimiters, or the HTML is not well formed. In that case, it is better to use an HTML parser.

Upvotes: 0

khachik
khachik

Reputation: 28703

Something like re.compile(r'<[^>]+>').sub('', string).split() should do the trick.

You might want to read this post about processing context-free languages using regular expressions.

Upvotes: 1

Related Questions