Reputation: 3033
I'm working on a translator that can translate text inside html tags and I'm using beautifulsoup because it's one of the best html parsers in python.
Here's the text and loading it into soup
In [95]: chalet.html
Out[95]: '<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>\r\n\r\n<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>\r\n\r\n<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>\r\n\r\n<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>\r\n\r\n<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>'
In [96]: html = soup(chalet.html)
In [97]: print(chalet.html)
<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>
<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>
<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>
<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>
<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>
Next is breaking it down into contents so I can parse them
In [105]: html.contents
Out[105]:
[<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>,
'\n',
<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>,
'\n',
<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>,
'\n',
<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>,
'\n',
<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>]
the thing is in between all these is new lines which I can ignore with a try and catch block but getting the string also seems to only work on some not all of them
In [107]: contents[0]
Out[107]: <h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>
In [108]: contents[0].string
Out[108]: '“Create a space I would be truly excited to stay in”.'
In [109]: contents[1]
Out[109]: '\n'
In [110]: contents[1].string
Out[110]: '\n'
In [111]: contents[2]
Out[111]: <h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>
In [112]: contents[2].string
If you know how to extract these sections in a way that it doesn't strip tags in between so replace
would work on the main string.
Upvotes: 0
Views: 341
Reputation: 41987
You can use a list comp and str.join
to join the list of contents without the newlines to get the desired output:
contents = ''.join([data for data in html.contents if data != '\n'])
Now, you can create the soup:
soup = BeautifulSoup(contents, 'lxml')
replace lxml
with your preferred parser.
Upvotes: 0
Reputation: 11071
Use .stripped_strings
property to get clean, stripped texts out of HTML.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-strings
from bs4 import BeautifulSoup
from pprint import pprint
html = '''
<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>
<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>
<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>
<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>
<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>
'''
soup = BeautifulSoup(html, 'html.parser')
texts = [*soup.stripped_strings]
pprint(texts)
output:
['“Create a space I would be truly excited to stay in”.',
'That was the brief given to renowned architect, Herve Marullaz, after Chalet '
'Joux Plane’s owner secured a large plot of mountain land that backed onto a '
'stream and an alpine woodland. The result was Chalet',
'Belle Chéry.',
'Belle Chéry is a chalet built without constraint. A destination, to be '
...
to get a single long string:
long_string = ' '.join(texts)
output:
“Create a space I would be truly excited to stay in”. That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet Belle C ...
Upvotes: 1