Converting HTML to plain text while preserving line breaks

Question

I'm using Beautiful Soup in Python to attempt to turn some fairly junky HTML into plain text while preserving some of the formatting from HTML, specifically the line break characters.

Here's an example:

from bs4 import BeautifulSoup

html_input = '''

Full
Name:
John Doe
Phone: 01234123123

Note: This
is a 
test message

It should be ignored.

'''

message_body_plain = BeautifulSoup(html_input.replace('
', '').replace('
', ''))
print (message_body_plain.get_text())

Sometimes the HTML I've got has newlines instead of spaces (see "Full Name" above), and sometimes it doesn't. I've tried taking out all the newlines and also replacing the HTML linebreaks with newline literals, but that breaks when I come across an HTML newline written in a way I hadn't considered. Surely there's a parser that does this for me?

Here's my preferred output:

Full Name: John Doe
Phone: 01234123123
Note: This is a test message
It should be ignored.

Note how the only newlines are from the HTML tags. Does anyone know the best way to achieve what I want?

Converting HTML to plain text while preserving line breaks

Answers (1)

Related Questions