Matteo Monti
Matteo Monti

Reputation: 8940

BeautifulSoup innerhtml?

Let's say I have a page with a div. I can easily get that div with soup.find().

Now that I have the result, I'd like to print the WHOLE innerhtml of that div: I mean, I'd need a string with ALL the html tags and text all toegether, exactly like the string I'd get in javascript with obj.innerHTML. Is this possible?

Upvotes: 79

Views: 76246

Answers (8)

BSimjoo
BSimjoo

Reputation: 174

If I do not misunderstand, you mean that for an example like this:

<div class="test">
    text in body
    <p>Hello World!</p>
</div>

the output should de look like this:

text in body
    <p>Hello World!</p>

So here is your answer:

''.join(map(str,tag.contents))

Upvotes: 1

Pikamander2
Pikamander2

Reputation: 8299

Given a BS4 soup element like <div id="outer"><div id="inner">foobar</div></div>, here are some various methods and attributes that can be used to retrieve its HTML and text in different ways along with an example of what they'll return.


InnerHTML:

inner_html = element.encode_contents()

'<div id="inner">foobar</div>'

OuterHTML:

outer_html = str(element)

'<div id="outer"><div id="inner">foobar</div></div>'

OuterHTML (prettified):

pretty_outer_html = element.prettify()

'''<div id="outer">
 <div id="inner">
  foobar
 </div>
</div>'''

Text only (using .text):

element_text = element.text

'foobar'

Text only (using .string):

element_string = element.string

'foobar'

Upvotes: 18

Praveen Kumar
Praveen Kumar

Reputation: 959

The easiest way is to use the children property.

inner_html = soup.find('body').children

it will return a list. So, you can get the full code using a simple for loop.

for html in inner_html:
    print(html)

Upvotes: 1

Y Y
Y Y

Reputation: 513

For just text, Beautiful Soup 4 get_text()

If you only want the human-readable text inside a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup, 'html.parser')

soup.get_text()
'\nI linked to example.com\n'
soup.i.get_text()
'example.com' 

You can specify a string to be used to join the bits of text together:

soup.get_text("|")
'\nI linked to |example.com|\n' 

You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:

soup.get_text("|", strip=True)
'I linked to|example.com' 

But at that point you might want to use the .stripped_strings generator instead, and process the text yourself:

[text for text in soup.stripped_strings]
# ['I linked to', 'example.com'] 

As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of <script>, <style>, and <template> tags are not considered to be ‘text’, since those tags are not part of the human-visible content of the page.

Refer here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

Upvotes: -4

Amir Saniyan
Amir Saniyan

Reputation: 13729

str(element) helps you to get outerHTML, then remove outer tag from the outer html string.

Upvotes: 3

ChrisD
ChrisD

Reputation: 3518

TL;DR

With BeautifulSoup 4 use element.encode_contents() if you want a UTF-8 encoded bytestring or use element.decode_contents() if you want a Python Unicode string. For example the DOM's innerHTML method might look something like this:

def innerHTML(element):
    """Returns the inner HTML of an element as a UTF-8 encoded bytestring"""
    return element.encode_contents()

These functions aren't currently in the online documentation so I'll quote the current function definitions and the doc string from the code.

encode_contents - since 4.0.4

def encode_contents(
    self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
    formatter="minimal"):
    """Renders the contents of this tag as a bytestring.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param encoding: The bytestring will be in this encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

See also the documentation on formatters; you'll most likely either use formatter="minimal" (the default) or formatter="html" (for html entities) unless you want to manually process the text in some way.

encode_contents returns an encoded bytestring. If you want a Python Unicode string then use decode_contents instead.


decode_contents - since 4.0.1

decode_contents does the same thing as encode_contents but returns a Python Unicode string instead of an encoded bytestring.

def decode_contents(self, indent_level=None,
                   eventual_encoding=DEFAULT_OUTPUT_ENCODING,
                   formatter="minimal"):
    """Renders the contents of this tag as a Unicode string.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param eventual_encoding: The tag is destined to be
       encoded into this encoding. This method is _not_
       responsible for performing that encoding. This information
       is passed in so that it can be substituted in if the
       document contains a <META> tag that mentions the document's
       encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

BeautifulSoup 3

BeautifulSoup 3 doesn't have the above functions, instead it has renderContents

def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
                   prettyPrint=False, indentLevel=0):
    """Renders the contents of this tag as a string in the given
    encoding. If encoding is None, returns a Unicode string.."""

This function was added back to BeautifulSoup 4 (in 4.0.4) for compatibility with BS3.

Upvotes: 106

Michael Litvin
Michael Litvin

Reputation: 4126

How about just unicode(x)? Seems to work for me.

Edit: This will give you the outer HTML and not the inner.

Upvotes: 1

peewhy
peewhy

Reputation: 430

One of the options could be use something like that:

 innerhtml = "".join([str(x) for x in div_element.contents]) 

Upvotes: 17

Related Questions