waffl
waffl

Reputation: 5511

Beautiful Soup - Get all text, but preserve link html?

I have to process a large archive of extremely messy HTML full of extraneous tables, spans and inline styles into markdown.

I am trying to use Beautiful Soup to accomplish this task, and my goal is basically the output of the get_text() function, except to preserve anchor tags with the href intact.

As an example, I would like to convert:

<td>
    <font><span>Hello</span><span>World</span></font><br>
    <span>Foo Bar <span>Baz</span></span><br>
    <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span>
</td>

Into:

Hello World
Foo Bar Baz
Example Link: <a href="https://google.com">Google</a>

My thought process so far was to simply grab all the tags and unwrap them all if they aren't anchors, but this causes the text to be repeated several times as soup.find_all(True) returns recursively nested tags as individual elements:

#!/usr/bin/env python

from bs4 import BeautifulSoup

example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'

soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(True)

for tag in tags:
    if (tag.name == 'a'):
        print("<a href='{}'>{}</a>".format(tag['href'], tag.get_text()))
    else:
        print(tag.get_text())

Which returns multiple fragments/duplicates as the parser moves down the tree:

HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorld
Hello
World

Foo Bar Baz
Baz

Example Link: Google
<a href='https://google.com'>Google</a>

Upvotes: 7

Views: 3083

Answers (6)

Satin Kriplani
Satin Kriplani

Reputation: 1

I found bleach perfect for my use-case:

data="""<!-- wp:html ->
    <div class="post_information-attention icon--warn"><p>手続き途中に以下メッセージが表示された方は、
    ‹a href="https://medium.aiplanet.com/advanced-rag-improving-retrieval-using-hypothetical-document-embeddings-hyde-1421a8ec075a" data-eventcategory="website" data-eventaction="cs_anchor_clicked"
    data-eventlabel=er_02"v<strong>こちら.       </strong></a>をご確認ください。</p>
    <ul style="list-style-type: disc;margin-bottom: Opx;">
    <Li>「処理に失敗しました」</ュ>
    <Li>「このアカウントのご利用は制限されています」</Li>
</ul></div>
<! -- /w:html
"""

import bleach
clean = bleach.clean(data, tags=['a'], strip=True)
print(clean)

Upvotes: 0

Ivan Kleshnin
Ivan Kleshnin

Reputation: 1844

The above examples will give you duplicate lines, as text in a > text is visited and concatenated twice (first as a and then as a string child of a. Here's my take, without such issue, also handling HTML comment and code tags:

from bs4 import BeautifulSoup, Comment, NavigableString

def html2text(html: str) -> str:
  soup = BeautifulSoup(html, features="html.parser")
  texts = []
  for element in soup.descendants:
    if isinstance(element, NavigableString) and not isinstance(element, Comment):
      if element.parent and element.parent.name == "code":
        texts.append("--")
        continue
      s = element.strip()
      if s:
        texts.append(
          f"{s}: {element.parent["href"]}"
          if element.parent and element.parent.name == "a" else
          s
        )
  return "\n\n".join(text for text in texts)

For HTML like:

text11
text12
<p>
  text21
  text22
</p>
<div>
  <div>text3</div>
</div>
<a href="https://url1.com">text4</a>
<img src="something" alt="text5"/>

<a href="https://url2.com"><img/> text6</a>
<!-- text7 -->

<code>console.log("Some text, should be excluded")</code>

It produces TEXT like:

text11
text12

text21
text22

text3

text4: https://url1.com

text6: https://url2.com

-- 

Alt. image text, code, and HTML comments are intentionally omitted. If you need them – pls. modify accordingly.

Upvotes: 0

Mitxel
Mitxel

Reputation: 453

In case someone wants to avoid overriding or decorating classes... A good enough approach imho is to iterate through all the descendants (recursive) of a root element and append (for example) span elements as children of the links <a> containing the link references, before doing wathever get_text() operations. So, using OP example:

from bs4 import BeautifulSoup, Tag

example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'

soup = BeautifulSoup(example_html, 'html.parser')

for el in soup.descendants:
    if isinstance(el, Tag):
        if el.name == 'a' and 'href' in el.attrs:
            new_span = soup.new_tag('span')
            new_span.string = '<a href="' + el.attrs['href'] + '">' + el.get_text() + '</a>'
            el.clear()  # if we want to "replace" and not just append
            el.insert(position=len(el.contents), new_child=new_span)

print(soup.get_text())  # HelloWorldFoo Bar BazExample Link: <a href="https://google.com">Google</a>

Notice that in real life you might find different kinds of links (href): #hello (anchors), javascript: (tricky stuff), /hello-world (relative urls, i.e. without protocol and domain)... So you might want to do something about it.

Upvotes: 0

nissimb
nissimb

Reputation: 11

The solution accepted is not working for me (I had the same issue as @alextre, probably due to a version changes). However, I managed to resolve it by making modifications and overriding the get_text() method instead of all_string().

from bs4 import BeautifulSoup, NavigableString, CData, Tag
class MyBeautifulSoup(BeautifulSoup):
    def get_text(self, separator='', strip=False, types=(NavigableString,)):
        text_parts = []

        for element in self.descendants:
            if isinstance(element, NavigableString):
                text_parts.append(str(element))
            elif isinstance(element, Tag):
                if element.name == 'a' and 'href' in element.attrs:
                    text_parts.append(element.get_text(separator=separator, strip=strip))
                    text_parts.append('(' + element['href'] + ')')
                elif isinstance(element, types):
                    text_parts.append(element.get_text(separator=separator, strip=strip))

        return separator.join(text_parts)```

Upvotes: 1

Scot
Scot

Reputation: 421

To only consider direct children set recursive = False then you need to process each 'td' and extract the text and anchor link individually.

#!/usr/bin/env python
from bs4 import BeautifulSoup

example_html = '<td><font><span>Some Example Text</span></font><br><span>Another Example Text</span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'

soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(recursive=False)
for tag in tags:
    print(tag.text)
    print(tag.find('a'))

If you want the text printed on separate lines you will have to process the spans individually.

for tag in tags:
    spans = tag.find_all('span')
    for span in spans:
        print(span.text)
print(tag.find('a'))

Upvotes: 2

alecxe
alecxe

Reputation: 473813

One of the possible ways to tackle this problem would be to introduce some special handling for a elements when it comes to printing out a text of an element.

You can do it by overriding _all_strings() method and returning a string representation of an a descendant element and skip a navigable string inside an a element. Something along these lines:

from bs4 import BeautifulSoup, NavigableString, CData, Tag


class MyBeautifulSoup(BeautifulSoup):
    def _all_strings(self, strip=False, types=(NavigableString, CData)):
        for descendant in self.descendants:
            # return "a" string representation if we encounter it
            if isinstance(descendant, Tag) and descendant.name == 'a':
                yield str(descendant)

            # skip an inner text node inside "a"
            if isinstance(descendant, NavigableString) and descendant.parent.name == 'a':
                continue

            # default behavior
            if (
                (types is None and not isinstance(descendant, NavigableString))
                or
                (types is not None and type(descendant) not in types)):
                continue

            if strip:
                descendant = descendant.strip()
                if len(descendant) == 0:
                    continue
            yield descendant

Demo:

In [1]: data = """
   ...: <td>
   ...:     <font><span>Hello</span><span>World</span></font><br>
   ...:     <span>Foo Bar <span>Baz</span></span><br>
   ...:     <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;tex
   ...: t-decoration: underline;">Google</a></span>
   ...: </td>
   ...: """

In [2]: soup = MyBeautifulSoup(data, "lxml")

In [3]: print(soup.get_text())

HelloWorld
Foo Bar Baz
Example Link: <a href="https://google.com" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;" target="_blank">Google</a>

Upvotes: 7

Related Questions