Zeynel
Zeynel

Reputation: 13515

How to eliminate html tags?

I am getting the first paragraph from pages and trying to extract words suitable to be tags or keywords. In some paragraphs there are links and I want to remove the tags:

For instance if the text is

A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
enter code heretitle="Byte">byte</a> ...

I want to remove

<b></b><a href="/wiki/Byte" title="Byte"></a>

to end up with this

A hex triplet is a six-digit, three-byte ...

A regex like this does not work:

>>> text = """A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
    enter code heretitle="Byte">byte</a> ..."""
>>> f = re.findall(r'<.+>', text)
>>> f
['<b>hex triplet</b>', '</a>']
>>>

What is the best way to do this?

I found several similar questions but none of them I think solves this particular problem.

Update with an example of BeautifulSoup extract (extract deletes the tag including its text and must run for each tag separately:

>>> soup = BeautifulSoup(text)
>>> [s.extract() for s in soup('b')]
[<b>hex triplet</b>]
>>> soup
A  is a six-digit, three-<a href="/wiki/Byte" enter code heretitle="Byte">byte</a> ...
>>> [s.extract() for s in soup('a')]
[<a href="/wiki/Byte" enter code heretitle="Byte">byte</a>]
>>> soup
A  is a six-digit, three- ...
>>> 

Update

For people with the same question: as mentioned by Brendan Long, this answer using HtmlParser works best.

Upvotes: 2

Views: 274

Answers (3)

varunl
varunl

Reputation: 20229

Beautiful Soup is the answer to your problem! Try it out, it's pretty awesome!

Html parsing would become so easy once you use it.

>>> text = """A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
... enter code heretitle="Byte">byte</a> ..."""
>>> soup = BeautifulSoup(text)
>>> ''.join(soup.findAll(text=True))
u'A hex triplet is a six-digit, three-byte ...'

If you have all your text that you want to extract enclosed in some outer tags like <body> ... </body> or some <div id="X"> .... </div>, then you can do the following (this illustration assumes that all the text you want to extract is enclosed within the <body> tag). Now you can selectively extract text from only some desired tags.

(Look at the documentation and examples and you will find many ways of parsing the DOM)

>>> text = """<body>A <b>hex triplet</b> is a six-digit, 
... three-<a href="/wiki/Byte"
... enter code heretitle="Byte">byte</a>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> ''.join(soup.body.findAll(text=True))
u'A hex triplet is a six-digit, three-byte'

Upvotes: 3

user557597
user557597

Reputation:

This is just the basic elements to strip tags. Including missing elements,
the \w's below represent qualified unicode tag names with prefix and body,
that need a join() statement to form the subexpression. The virtue of parsing
html/xml with regex is it won't fail on the first ill-formed instance, which
makes it perfect for fixing it! The vice is that its slow as sh*t, especially
with unicode.

Unfortunately, stripping tags destroys content since by definition, markup formats content.

Try this on a big web page. This should be translatable into python.

$rx_expanded = '
<
(?:
    (?:
       (?:
           (?:script|style) \s*
         | (?:script|style) \s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*
       )> .*? </(?:script|style)\s*
    )
  |
    (?:
        /?\w+\s*/?
      | \w+\s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*/?
      | !(?:DOCTYPE.*?|--.*?--)
    )
)
>
';

$html =~ s/$rx_expanded/[was]/xsg;

Upvotes: 0

John Kugelman
John Kugelman

Reputation: 361547

The + quantifier is greedy, meaning it will find the longest possible match. Add a ? to force it to find the shortest possible match:

>>> re.findall(r'<.+?>', text)
['<b>', '</b>', '</a>']

Another way to write the regex is to explicitly exclude right angle brackets inside a tag, using [^>] instead of ..

>>> re.findall(r'<[^>]+>', text)
['<b>', '</b>', '<a href="/wiki/Byte"\n    enter code heretitle="Byte">', '</a>']

An advantage of this approach is that it will also match newlines (\n). You can get the same behavior with . if you add the re.DOTALL flag.

>>> re.findall(r'<.+?>', text, re.DOTALL)
['<b>', '</b>', '<a href="/wiki/Byte"\n    enter code heretitle="Byte">', '</a>']

To strip out the tags, use re.sub:

>>> re.sub(r'<.+?>', '', text, flags=re.DOTALL)
'A hex triplet is a six-digit, three-byte ...'

Upvotes: 2

Related Questions