Nishant Saraswat
Nishant Saraswat

Reputation: 5

How to get beautiful soup get_text() to consider line spacing for paragraph tags

I am trying to convert html to text. Using BeautifulSoup library. However, it doesn't consider spacing (or new line) for paragraph tags

from bs4 import BeautifulSoup
test_input = '<html><p>this is sentence 1</p><p>this is sentence 2</p></html>'
soup = BeautifulSoup(test_input, 'html.parser')
print(soup.get_text())

Output: this is sentence 1this is sentence 2

Expectation: this is sentence 1 this is sentence 2

Need help with understanding if BeautifulSoup can somehow handle that or there is any alternative library that could be used?

Upvotes: 0

Views: 793

Answers (1)

Akash senta
Akash senta

Reputation: 493

You can do as mentioned below

from bs4 import BeautifulSoup
test_input = '<html><p>this is sentence 1</p><p>this is sentence 2</p></html>'
soup = BeautifulSoup(test_input, 'html.parser')
data = soup.find_all('p')
output = " ".join([p1.text for p1 in data])

output will be

this is sentence 1 this is sentence 2

if you want it in new line just change this line

output = "\n".join([p1.text for p1 in data])

and output will be

this is sentence 1 
this is sentence 2

Upvotes: 1

Related Questions