Reputation: 1469
There are many answers to how to convert HTML to text using BeautifulSoup (for example https://stackoverflow.com/a/24618186/3946214)
There are also many answers on how to extract links from HTML using BeautifulSoup.
What I need is a way to turn HTML into a text only version, but preserve links inline with the text that's near the link. For example, if I had some HTML that looked like this:
<div>Click <a href="www.google.com">Here</a> to receive a quote</div>
It would be nice to convert this to "Click Here (www.google.com) to receive a quote."
The usecase here is that I need to convert HTML for emails into a text only version, and it would be nice to have the links where they are semantically located in the HTML, instead of at the bottom. This exact syntax isn't required. I'd appreciate any guidance into how to do this. Thank you!
Upvotes: 0
Views: 1202
Reputation: 195408
If you want beautifulsoup
solution, you can start with this example (it probably needs more tuning with real-world data):
data = '<div>Click <a href="www.google.com">Here</a> to receive a quote.</div>'
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
# append the text to the link
for a in soup.select('a[href]'):
a.contents.append(soup.new_string(' ({})'.format(a['href'])))
# unwrap() all tags
for tag in soup.select('*'):
tag.unwrap()
print(soup)
Prints:
Click Here (www.google.com) to receive a quote.
Upvotes: 1
Reputation: 11505
import html2text
data = """
<div>Click <a href="www.google.com">Here</a> to receive a quote</div>
"""
print(html2text.html2text(data))
Output:
Click [Here](www.google.com) to receive a quote
Upvotes: 1