Reputation: 13515
I am getting the first paragraph from pages and trying to extract words suitable to be tags or keywords. In some paragraphs there are links and I want to remove the tags:
For instance if the text is
A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
enter code heretitle="Byte">byte</a> ...
I want to remove
<b></b><a href="/wiki/Byte" title="Byte"></a>
to end up with this
A hex triplet is a six-digit, three-byte ...
A regex like this does not work:
>>> text = """A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
enter code heretitle="Byte">byte</a> ..."""
>>> f = re.findall(r'<.+>', text)
>>> f
['<b>hex triplet</b>', '</a>']
>>>
What is the best way to do this?
I found several similar questions but none of them I think solves this particular problem.
Update with an example of BeautifulSoup extract (extract deletes the tag including its text and must run for each tag separately:
>>> soup = BeautifulSoup(text)
>>> [s.extract() for s in soup('b')]
[<b>hex triplet</b>]
>>> soup
A is a six-digit, three-<a href="/wiki/Byte" enter code heretitle="Byte">byte</a> ...
>>> [s.extract() for s in soup('a')]
[<a href="/wiki/Byte" enter code heretitle="Byte">byte</a>]
>>> soup
A is a six-digit, three- ...
>>>
Update
For people with the same question: as mentioned by Brendan Long, this answer using HtmlParser works best.
Upvotes: 2
Views: 274
Reputation: 20229
Beautiful Soup is the answer to your problem! Try it out, it's pretty awesome!
Html parsing would become so easy once you use it.
>>> text = """A <b>hex triplet</b> is a six-digit, three-<a href="/wiki/Byte"
... enter code heretitle="Byte">byte</a> ..."""
>>> soup = BeautifulSoup(text)
>>> ''.join(soup.findAll(text=True))
u'A hex triplet is a six-digit, three-byte ...'
If you have all your text that you want to extract enclosed in some outer tags like <body> ... </body>
or some <div id="X"> .... </div>
, then you can do the following (this illustration assumes that all the text you want to extract is enclosed within the <body>
tag). Now you can selectively extract text from only some desired tags.
(Look at the documentation and examples and you will find many ways of parsing the DOM)
>>> text = """<body>A <b>hex triplet</b> is a six-digit,
... three-<a href="/wiki/Byte"
... enter code heretitle="Byte">byte</a>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> ''.join(soup.body.findAll(text=True))
u'A hex triplet is a six-digit, three-byte'
Upvotes: 3
Reputation:
This is just the basic elements to strip tags. Including missing elements,
the \w's below represent qualified unicode tag names with prefix and body,
that need a join() statement to form the subexpression. The virtue of parsing
html/xml with regex is it won't fail on the first ill-formed instance, which
makes it perfect for fixing it! The vice is that its slow as sh*t, especially
with unicode.
Unfortunately, stripping tags destroys content since by definition, markup formats content.
Try this on a big web page. This should be translatable into python.
$rx_expanded = '
<
(?:
(?:
(?:
(?:script|style) \s*
| (?:script|style) \s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*
)> .*? </(?:script|style)\s*
)
|
(?:
/?\w+\s*/?
| \w+\s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*/?
| !(?:DOCTYPE.*?|--.*?--)
)
)
>
';
$html =~ s/$rx_expanded/[was]/xsg;
Upvotes: 0
Reputation: 361547
The +
quantifier is greedy, meaning it will find the longest possible match. Add a ?
to force it to find the shortest possible match:
>>> re.findall(r'<.+?>', text)
['<b>', '</b>', '</a>']
Another way to write the regex is to explicitly exclude right angle brackets inside a tag, using [^>]
instead of .
.
>>> re.findall(r'<[^>]+>', text)
['<b>', '</b>', '<a href="/wiki/Byte"\n enter code heretitle="Byte">', '</a>']
An advantage of this approach is that it will also match newlines (\n
). You can get the same behavior with .
if you add the re.DOTALL
flag.
>>> re.findall(r'<.+?>', text, re.DOTALL)
['<b>', '</b>', '<a href="/wiki/Byte"\n enter code heretitle="Byte">', '</a>']
To strip out the tags, use re.sub
:
>>> re.sub(r'<.+?>', '', text, flags=re.DOTALL)
'A hex triplet is a six-digit, three-byte ...'
Upvotes: 2