Arya
Arya

Reputation: 1469

Extract Links from HTML In Line with Text with Python/BeautifulSoup

There are many answers to how to convert HTML to text using BeautifulSoup (for example https://stackoverflow.com/a/24618186/3946214)

There are also many answers on how to extract links from HTML using BeautifulSoup.

What I need is a way to turn HTML into a text only version, but preserve links inline with the text that's near the link. For example, if I had some HTML that looked like this:

<div>Click <a href="www.google.com">Here</a> to receive a quote</div>

It would be nice to convert this to "Click Here (www.google.com) to receive a quote."

The usecase here is that I need to convert HTML for emails into a text only version, and it would be nice to have the links where they are semantically located in the HTML, instead of at the bottom. This exact syntax isn't required. I'd appreciate any guidance into how to do this. Thank you!

Upvotes: 0

Views: 1202

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195408

If you want beautifulsoup solution, you can start with this example (it probably needs more tuning with real-world data):

data = '<div>Click <a href="www.google.com">Here</a> to receive a quote.</div>'

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

# append the text to the link
for a in soup.select('a[href]'):
    a.contents.append(soup.new_string(' ({})'.format(a['href'])))

# unwrap() all tags
for tag in soup.select('*'):
    tag.unwrap()

print(soup)

Prints:

Click Here (www.google.com) to receive a quote.

Upvotes: 1

import html2text

data = """
<div>Click <a href="www.google.com">Here</a> to receive a quote</div>
"""


print(html2text.html2text(data))

Output:

Click [Here](www.google.com) to receive a quote

Upvotes: 1

Related Questions