Reputation: 87

Python Beautiful Soup unwrap() not working as expected - want to extract content of a tag

I'm new to working with Beautiful Soup and have problems understanding why unwrap() works the way it does in my case.

I have python 3.6.9 and beautifulsoup4 4.8.2.

My input HTML is:

 html='''
    <html>
    <head>
        <meta charset="utf-8"/>
        <link rel="stylesheet" type="text/css" href="../../common/style.css"/>
    </head>
    <body>
    <div id="content">
       <h3  HEAD /h3>
          <div class="myclass">
          <br>
          MY TEXT
          <br>
         </div>
        <h3  HEAD2 /h3>
          <div class="myclass">
          <br>
          MY TEXT 2
          <br>
         </div>
    </div>
    </body>
    </html>
    '''

And I want to get the content of the div with the id "content". I figured this would be done by using unwrap():

soup=BeautifulSoup(html, 'lxml')    
content=soup.find('div', {"id": "content"}).unwrap()

But this gives me the tag, without its contents:

print(content):

<div id="content"></div>

What happens here? How do I correctly extract the contents of the tag, without keeping the surrounding tag?

The output I expect is:

   <h3  HEAD /h3>
      <div class="myclass">
      <br>
      MY TEXT
      <br>
     </div>
    <h3  HEAD2 /h3>
      <div class="myclass">
      <br>
      MY TEXT 2
      <br>
     </div>

When using the approach with .children I have issues with escaping the tags, when appending to a BeautifulSoup object:

final_content=''.join([str(i) for i in content.children]) 
body.append(final_content)

This results in:

&lt;h3 head=""&gt;
&lt;div class="myclass"&gt;
&lt;br/&gt;
      MY TEXT
      &lt;br/&gt;
&lt;/div&gt;
&lt;h3 head2=""&gt;
&lt;div class="myclass"&gt;
&lt;br/&gt;
      MY TEXT 2
      &lt;br/&gt;
&lt;/div&gt;
&lt;/h3&gt;&lt;/h3&gt;</div>

Upvotes: 3

Answers (3)

Vlad Turcanu

Reputation: 56

TL;DR: Print soup, not content

I had the same problem and couldn't figure out why unwrap() doesn't return what I want it to. The reason is that it works a little differently than we expect.

unwrap() cleans the tag in the initial soup and returns the tag. Whatever contents we have saved in other variables using soup.find() will contain only the tags, without the content.

Upvotes: 4

Juan C

Reputation: 6132

First, we edit your html so it actually works (problem was in h3 tag):

html='''
   <html>
   <head>
       <meta charset="utf-8"/>
       <link rel="stylesheet" type="text/css" href="../../common/style.css"/>
   </head>
   <body>
   <div id="content">
      <h3>  HEAD </h3>
         <div class="myclass">
         <br>
         MY TEXT
         <br>
        </div>
       <h3>  HEAD2 </h3>
         <div class="myclass">
         <br>
         MY TEXT 2
         <br>
        </div>
   </div>
   </body>
   </html>
   '''

unwrap() removes the tag from your soup and puts its content inside the parent tag (if you check your soup again, there's no "content" id after running your code). So you should do something like:

content = soup.find('div', {"id": "content"})
content.contents[1:]

Output:

<h3>  HEAD </h3>, '\n', <div class="myclass">
 <br/>
          MY TEXT
          <br/>
 </div>, '\n', <h3>  HEAD2 </h3>, '\n', <div class="myclass">
 <br/>
          MY TEXT 2
          <br/>
 </div>, '\n'

Alternative using `children` based on @KunduK's answer:

final_content = ''.join([str(i) for i in content.children])

Output 2:

''.join([str(i) for i in content.children])
Out[96]: '\n<h3>  HEAD </h3>\n<div class="myclass">\n<br/>\n         MY TEXT\n         <br/>\n</div>\n<h3>  HEAD2 </h3>\n<div class="myclass">\n<br/>\n         MY TEXT 2\n         <br/>\n</div>\n'

Upvotes: 2

KunduK

Reputation: 33384

Use element.children and then iterate.

html='''
    <html>
    <head>
        <meta charset="utf-8"/>
        <link rel="stylesheet" type="text/css" href="../../common/style.css"/>
    </head>
    <body>
    <div id="content">
       <h3>  HEAD </h3>
          <div class="myclass">
          <br>
          MY TEXT
          <br>
         </div>
        <h3>  HEAD2 </h3>
          <div class="myclass">
          <br>
          MY TEXT 2
          <br>
         </div>
    </div>
    </body>
    </html>
    '''

soup=BeautifulSoup(html,'html.parser')
for item in soup.find('div',id='content').children:
     print(item)

Output:

<h3>  HEAD </h3>


<div class="myclass">
<br/>
          MY TEXT
          <br/>
</div>


<h3>  HEAD2 </h3>


<div class="myclass">
<br/>
          MY TEXT 2
          <br/>
</div>

Want to get whole item in a variable then try it.

html='''
    <html>
    <head>
        <meta charset="utf-8"/>
        <link rel="stylesheet" type="text/css" href="../../common/style.css"/>
    </head>
    <body>
    <div id="content">
       <h3>  HEAD </h3>
          <div class="myclass">
          <br>
          MY TEXT
          <br>
         </div>
        <h3>  HEAD2 </h3>
          <div class="myclass">
          <br>
          MY TEXT 2
          <br>
         </div>
    </div>
    </body>
    </html>
    '''

soup=BeautifulSoup(html,'html.parser')
str1=''
for item in soup.find('div',id='content').children:
    str1=str1+str(item)

print(str1)