Sam B.
Sam B.

Reputation: 3033

beautifulsoup: get inner content inside html tags

I'm working on a translator that can translate text inside html tags and I'm using beautifulsoup because it's one of the best html parsers in python.

Here's the text and loading it into soup

In [95]: chalet.html                                                                                                                                                                       
Out[95]: '<h4><strong>&ldquo;Create a space I would be truly excited to stay in&rdquo;.</strong></h4>\r\n\r\n<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane&rsquo;s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Ch&eacute;ry.</strong></h4>\r\n\r\n<p>Belle Ch&eacute;ry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>\r\n\r\n<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children&rsquo;s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>\r\n\r\n<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children&rsquo;s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>'

In [96]: html = soup(chalet.html)                                                                                                                                                          

In [97]: print(chalet.html)                                                                                                                                                                
<h4><strong>&ldquo;Create a space I would be truly excited to stay in&rdquo;.</strong></h4>

<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane&rsquo;s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Ch&eacute;ry.</strong></h4>

<p>Belle Ch&eacute;ry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>

<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children&rsquo;s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>

<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children&rsquo;s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>

Next is breaking it down into contents so I can parse them

In [105]: html.contents                                                                                                                                                                    
Out[105]: 
[<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>,
'\n',
<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>,
'\n',
<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>,
'\n',
<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>,
'\n',
<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>]

the thing is in between all these is new lines which I can ignore with a try and catch block but getting the string also seems to only work on some not all of them

In [107]: contents[0]                                                                                                                                                                      
Out[107]: <h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>

In [108]: contents[0].string                                                                                                                                                               
Out[108]: '“Create a space I would be truly excited to stay in”.'

In [109]: contents[1]                                                                                                                                                                      
Out[109]: '\n'

In [110]: contents[1].string                                                                                                                                                               
Out[110]: '\n'

In [111]: contents[2]                                                                                                                                                                      
Out[111]: <h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>

In [112]: contents[2].string    

If you know how to extract these sections in a way that it doesn't strip tags in between so replace would work on the main string.

Upvotes: 0

Views: 341

Answers (2)

heemayl
heemayl

Reputation: 41987

You can use a list comp and str.join to join the list of contents without the newlines to get the desired output:

contents = ''.join([data for data in html.contents if data != '\n'])

Now, you can create the soup:

soup = BeautifulSoup(contents, 'lxml')

replace lxml with your preferred parser.

Upvotes: 0

abdusco
abdusco

Reputation: 11071

Use .stripped_strings property to get clean, stripped texts out of HTML.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-strings

from bs4 import BeautifulSoup
from pprint import pprint

html = '''
<h4><strong>&ldquo;Create a space I would be truly excited to stay in&rdquo;.</strong></h4>
<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane&rsquo;s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Ch&eacute;ry.</strong></h4>
<p>Belle Ch&eacute;ry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>
<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children&rsquo;s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>
<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children&rsquo;s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>
'''
soup = BeautifulSoup(html, 'html.parser')
texts = [*soup.stripped_strings]
pprint(texts)

output:

['“Create a space I would be truly excited to stay in”.',
 'That was the brief given to renowned architect, Herve Marullaz, after Chalet '
 'Joux Plane’s owner secured a large plot of mountain land that backed onto a '
 'stream and an alpine woodland. The result was Chalet',
 'Belle Chéry.',
 'Belle Chéry is a chalet built without constraint. A destination, to be '
...

to get a single long string:

long_string = ' '.join(texts)

output:

“Create a space I would be truly excited to stay in”. That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet Belle C ...

Upvotes: 1

Related Questions