Reputation: 21188
How to find all words except the ones in tags using RE module?
I know how to find something, but how to do it opposite way? Like I write something to search for, but acutally I want to search for every word except everything inside tags and tags themselves?
So far I managed this:
f = open (filename,'r')
data = re.findall(r"<.+?>", f.read())
Well it prints everything inside <>
tags, but how to make it find every word except thats inside those tags?
I tried ^
, to use at the start of pattern inside []
, but then symbols as .
are treated literally without special meaning.
Also I managed to solve this, by splitting string, using '''\= <>"'''
, then checking whole string for words that are inside <>
tags (like align, right, td etc), and appending words that are not inside <> tags in another list. But that a bit ugly solution.
Is there some simple way to search for every word except anything that's inside <>
and these tags themselves?
So let say string 'hello 123 <b>Bold</b> <p>end</p>'
with re.findall
, would return:
['hello', '123', 'Bold', 'end']
Upvotes: 0
Views: 1061
Reputation: 59238
If you want to avoid using a regular expression, BeautifulSoup makes it very easy to get just the text from an HTML document:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
text = "".join(soup.findAll(text=True))
From there, you can get the list of words with split
:
words = text.split()
Upvotes: 2
Reputation: 43703
Using regex for this kind of task is not the best idea, as you cannot make it work for every case.
One of solutions that should catch most of such words is regex pattern
\b\w+\b(?![^<]*>)
Upvotes: 2
Reputation: 58619
Strip out all the tags (using your original regex), then match words.
The only weakness is if there are <
s in the strings other than as tag delimiters, or the HTML is not well formed. In that case, it is better to use an HTML parser.
Upvotes: 0