Reputation: 5
I am trying to convert html to text. Using BeautifulSoup library. However, it doesn't consider spacing (or new line) for paragraph tags
from bs4 import BeautifulSoup
test_input = '<html><p>this is sentence 1</p><p>this is sentence 2</p></html>'
soup = BeautifulSoup(test_input, 'html.parser')
print(soup.get_text())
Output: this is sentence 1this is sentence 2
Expectation: this is sentence 1 this is sentence 2
Need help with understanding if BeautifulSoup can somehow handle that or there is any alternative library that could be used?
Upvotes: 0
Views: 793
Reputation: 493
You can do as mentioned below
from bs4 import BeautifulSoup
test_input = '<html><p>this is sentence 1</p><p>this is sentence 2</p></html>'
soup = BeautifulSoup(test_input, 'html.parser')
data = soup.find_all('p')
output = " ".join([p1.text for p1 in data])
output will be
this is sentence 1 this is sentence 2
if you want it in new line just change this line
output = "\n".join([p1.text for p1 in data])
and output will be
this is sentence 1
this is sentence 2
Upvotes: 1