Reputation: 45
I'm using Beautiful Soup in Python to attempt to turn some fairly junky HTML into plain text while preserving some of the formatting from HTML, specifically the line break characters.
Here's an example:
from bs4 import BeautifulSoup
html_input = '''
<body>
<p>Full
Name:
John Doe</p>
Phone: 01234123123<br />
Note: This
is a
test message<br>
It should be ignored.
</body>
'''
message_body_plain = BeautifulSoup(html_input.replace('\n', '').replace('\r', ''))
print (message_body_plain.get_text())
Sometimes the HTML I've got has newlines instead of spaces (see "Full Name" above), and sometimes it doesn't. I've tried taking out all the newlines and also replacing the HTML linebreaks with newline literals, but that breaks when I come across an HTML newline written in a way I hadn't considered. Surely there's a parser that does this for me?
Here's my preferred output:
Full Name: John Doe
Phone: 01234123123
Note: This is a test message
It should be ignored.
Note how the only newlines are from the HTML tags. Does anyone know the best way to achieve what I want?
Upvotes: 1
Views: 6735
Reputation: 194
staying within BS you can also try
soup = BeautifulSoup(html_input , "html.parser")
for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
elem.replace_with(elem.text + "\n\n")
Upvotes: 2