davis
davis

Reputation: 45

Converting HTML to plain text while preserving line breaks

I'm using Beautiful Soup in Python to attempt to turn some fairly junky HTML into plain text while preserving some of the formatting from HTML, specifically the line break characters.

Here's an example:

from bs4 import BeautifulSoup

html_input = '''
<body>
<p>Full
Name:
John Doe</p>
Phone: 01234123123<br />
Note: This
is a 
test message<br>
It should be ignored.
</body>
'''

message_body_plain = BeautifulSoup(html_input.replace('\n', '').replace('\r', ''))
print (message_body_plain.get_text())

Sometimes the HTML I've got has newlines instead of spaces (see "Full Name" above), and sometimes it doesn't. I've tried taking out all the newlines and also replacing the HTML linebreaks with newline literals, but that breaks when I come across an HTML newline written in a way I hadn't considered. Surely there's a parser that does this for me?

Here's my preferred output:

Full Name: John Doe
Phone: 01234123123
Note: This is a test message
It should be ignored.

Note how the only newlines are from the HTML tags. Does anyone know the best way to achieve what I want?

Upvotes: 1

Views: 6735

Answers (1)

Gianluca Tarasconi
Gianluca Tarasconi

Reputation: 194

staying within BS you can also try

soup = BeautifulSoup(html_input , "html.parser")

for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
            elem.replace_with(elem.text + "\n\n")

Upvotes: 2

Related Questions