Reputation: 5511
I have to process a large archive of extremely messy HTML full of extraneous tables, spans and inline styles into markdown.
I am trying to use Beautiful Soup to accomplish this task, and my goal is basically the output of the get_text()
function, except to preserve anchor tags with the href
intact.
As an example, I would like to convert:
<td>
<font><span>Hello</span><span>World</span></font><br>
<span>Foo Bar <span>Baz</span></span><br>
<span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span>
</td>
Into:
Hello World
Foo Bar Baz
Example Link: <a href="https://google.com">Google</a>
My thought process so far was to simply grab all the tags and unwrap them all if they aren't anchors, but this causes the text to be repeated several times as soup.find_all(True)
returns recursively nested tags as individual elements:
#!/usr/bin/env python
from bs4 import BeautifulSoup
example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'
soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(True)
for tag in tags:
if (tag.name == 'a'):
print("<a href='{}'>{}</a>".format(tag['href'], tag.get_text()))
else:
print(tag.get_text())
Which returns multiple fragments/duplicates as the parser moves down the tree:
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorld
Hello
World
Foo Bar Baz
Baz
Example Link: Google
<a href='https://google.com'>Google</a>
Upvotes: 7
Views: 3083
Reputation: 1
I found bleach
perfect for my use-case:
data="""<!-- wp:html ->
<div class="post_information-attention icon--warn"><p>手続き途中に以下メッセージが表示された方は、
‹a href="https://medium.aiplanet.com/advanced-rag-improving-retrieval-using-hypothetical-document-embeddings-hyde-1421a8ec075a" data-eventcategory="website" data-eventaction="cs_anchor_clicked"
data-eventlabel=er_02"v<strong>こちら. </strong></a>をご確認ください。</p>
<ul style="list-style-type: disc;margin-bottom: Opx;">
<Li>「処理に失敗しました」</ュ>
<Li>「このアカウントのご利用は制限されています」</Li>
</ul></div>
<! -- /w:html
"""
import bleach
clean = bleach.clean(data, tags=['a'], strip=True)
print(clean)
Upvotes: 0
Reputation: 1844
The above examples will give you duplicate lines, as text in a > text
is visited and concatenated twice (first as a
and then as a string child of a
. Here's my take, without such issue, also handling HTML comment and code
tags:
from bs4 import BeautifulSoup, Comment, NavigableString
def html2text(html: str) -> str:
soup = BeautifulSoup(html, features="html.parser")
texts = []
for element in soup.descendants:
if isinstance(element, NavigableString) and not isinstance(element, Comment):
if element.parent and element.parent.name == "code":
texts.append("--")
continue
s = element.strip()
if s:
texts.append(
f"{s}: {element.parent["href"]}"
if element.parent and element.parent.name == "a" else
s
)
return "\n\n".join(text for text in texts)
For HTML like:
text11
text12
<p>
text21
text22
</p>
<div>
<div>text3</div>
</div>
<a href="https://url1.com">text4</a>
<img src="something" alt="text5"/>
<a href="https://url2.com"><img/> text6</a>
<!-- text7 -->
<code>console.log("Some text, should be excluded")</code>
It produces TEXT like:
text11
text12
text21
text22
text3
text4: https://url1.com
text6: https://url2.com
--
Alt. image text, code, and HTML comments are intentionally omitted. If you need them – pls. modify accordingly.
Upvotes: 0
Reputation: 453
In case someone wants to avoid overriding or decorating classes... A good enough approach imho is to iterate through all the descendants (recursive) of a root element and append (for example) span elements as children of the links <a>
containing the link references, before doing wathever get_text() operations. So, using OP example:
from bs4 import BeautifulSoup, Tag
example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'
soup = BeautifulSoup(example_html, 'html.parser')
for el in soup.descendants:
if isinstance(el, Tag):
if el.name == 'a' and 'href' in el.attrs:
new_span = soup.new_tag('span')
new_span.string = '<a href="' + el.attrs['href'] + '">' + el.get_text() + '</a>'
el.clear() # if we want to "replace" and not just append
el.insert(position=len(el.contents), new_child=new_span)
print(soup.get_text()) # HelloWorldFoo Bar BazExample Link: <a href="https://google.com">Google</a>
Notice that in real life you might find different kinds of links (href): #hello
(anchors), javascript:
(tricky stuff), /hello-world
(relative urls, i.e. without protocol and domain)... So you might want to do something about it.
Upvotes: 0
Reputation: 11
The solution accepted is not working for me (I had the same issue as @alextre, probably due to a version changes). However, I managed to resolve it by making modifications and overriding the get_text() method instead of all_string().
from bs4 import BeautifulSoup, NavigableString, CData, Tag
class MyBeautifulSoup(BeautifulSoup):
def get_text(self, separator='', strip=False, types=(NavigableString,)):
text_parts = []
for element in self.descendants:
if isinstance(element, NavigableString):
text_parts.append(str(element))
elif isinstance(element, Tag):
if element.name == 'a' and 'href' in element.attrs:
text_parts.append(element.get_text(separator=separator, strip=strip))
text_parts.append('(' + element['href'] + ')')
elif isinstance(element, types):
text_parts.append(element.get_text(separator=separator, strip=strip))
return separator.join(text_parts)```
Upvotes: 1
Reputation: 421
To only consider direct children set recursive = False then you need to process each 'td' and extract the text and anchor link individually.
#!/usr/bin/env python
from bs4 import BeautifulSoup
example_html = '<td><font><span>Some Example Text</span></font><br><span>Another Example Text</span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'
soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(recursive=False)
for tag in tags:
print(tag.text)
print(tag.find('a'))
If you want the text printed on separate lines you will have to process the spans individually.
for tag in tags:
spans = tag.find_all('span')
for span in spans:
print(span.text)
print(tag.find('a'))
Upvotes: 2
Reputation: 473813
One of the possible ways to tackle this problem would be to introduce some special handling for a
elements when it comes to printing out a text of an element.
You can do it by overriding _all_strings()
method and returning a string representation of an a
descendant element and skip a navigable string inside an a
element. Something along these lines:
from bs4 import BeautifulSoup, NavigableString, CData, Tag
class MyBeautifulSoup(BeautifulSoup):
def _all_strings(self, strip=False, types=(NavigableString, CData)):
for descendant in self.descendants:
# return "a" string representation if we encounter it
if isinstance(descendant, Tag) and descendant.name == 'a':
yield str(descendant)
# skip an inner text node inside "a"
if isinstance(descendant, NavigableString) and descendant.parent.name == 'a':
continue
# default behavior
if (
(types is None and not isinstance(descendant, NavigableString))
or
(types is not None and type(descendant) not in types)):
continue
if strip:
descendant = descendant.strip()
if len(descendant) == 0:
continue
yield descendant
Demo:
In [1]: data = """
...: <td>
...: <font><span>Hello</span><span>World</span></font><br>
...: <span>Foo Bar <span>Baz</span></span><br>
...: <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;tex
...: t-decoration: underline;">Google</a></span>
...: </td>
...: """
In [2]: soup = MyBeautifulSoup(data, "lxml")
In [3]: print(soup.get_text())
HelloWorld
Foo Bar Baz
Example Link: <a href="https://google.com" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;" target="_blank">Google</a>
Upvotes: 7