Reputation: 8940
Let's say I have a page with a div
. I can easily get that div with soup.find()
.
Now that I have the result, I'd like to print the WHOLE innerhtml
of that div
: I mean, I'd need a string with ALL the html tags and text all toegether, exactly like the string I'd get in javascript with obj.innerHTML
. Is this possible?
Upvotes: 79
Views: 76246
Reputation: 174
If I do not misunderstand, you mean that for an example like this:
<div class="test">
text in body
<p>Hello World!</p>
</div>
the output should de look like this:
text in body
<p>Hello World!</p>
So here is your answer:
''.join(map(str,tag.contents))
Upvotes: 1
Reputation: 8299
Given a BS4 soup element like <div id="outer"><div id="inner">foobar</div></div>
, here are some various methods and attributes that can be used to retrieve its HTML and text in different ways along with an example of what they'll return.
InnerHTML:
inner_html = element.encode_contents()
'<div id="inner">foobar</div>'
OuterHTML:
outer_html = str(element)
'<div id="outer"><div id="inner">foobar</div></div>'
OuterHTML (prettified):
pretty_outer_html = element.prettify()
'''<div id="outer">
<div id="inner">
foobar
</div>
</div>'''
Text only (using .text):
element_text = element.text
'foobar'
Text only (using .string):
element_string = element.string
'foobar'
Upvotes: 18
Reputation: 959
The easiest way is to use the children property.
inner_html = soup.find('body').children
it will return a list. So, you can get the full code using a simple for loop.
for html in inner_html:
print(html)
Upvotes: 1
Reputation: 513
get_text()
If you only want the human-readable text inside a document or tag, you can use the get_text()
method. It returns all the text in a document or beneath a tag, as a single Unicode string:
markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup, 'html.parser')
soup.get_text()
'\nI linked to example.com\n'
soup.i.get_text()
'example.com'
You can specify a string to be used to join the bits of text together:
soup.get_text("|")
'\nI linked to |example.com|\n'
You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:
soup.get_text("|", strip=True)
'I linked to|example.com'
But at that point you might want to use the .stripped_strings
generator instead, and process the text yourself:
[text for text in soup.stripped_strings]
# ['I linked to', 'example.com']
As of Beautiful Soup version 4.9.0, when lxml
or html.parser
are in use, the contents of <script>
, <style>
, and <template>
tags are not considered to be ‘text’
, since those tags are not part of the human-visible content of the page.
Refer here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text
Upvotes: -4
Reputation: 13729
str(element)
helps you to get outerHTML, then remove outer tag from the outer html string.
Upvotes: 3
Reputation: 3518
With BeautifulSoup 4 use element.encode_contents()
if you want a UTF-8 encoded bytestring or use element.decode_contents()
if you want a Python Unicode string. For example the DOM's innerHTML method might look something like this:
def innerHTML(element):
"""Returns the inner HTML of an element as a UTF-8 encoded bytestring"""
return element.encode_contents()
These functions aren't currently in the online documentation so I'll quote the current function definitions and the doc string from the code.
encode_contents
- since 4.0.4def encode_contents(
self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
formatter="minimal"):
"""Renders the contents of this tag as a bytestring.
:param indent_level: Each line of the rendering will be
indented this many spaces.
:param encoding: The bytestring will be in this encoding.
:param formatter: The output formatter responsible for converting
entities to Unicode characters.
"""
See also the documentation on formatters; you'll most likely either use formatter="minimal"
(the default) or formatter="html"
(for html entities) unless you want to manually process the text in some way.
encode_contents
returns an encoded bytestring. If you want a Python Unicode string then use decode_contents
instead.
decode_contents
- since 4.0.1decode_contents
does the same thing as encode_contents
but returns a Python Unicode string instead of an encoded bytestring.
def decode_contents(self, indent_level=None,
eventual_encoding=DEFAULT_OUTPUT_ENCODING,
formatter="minimal"):
"""Renders the contents of this tag as a Unicode string.
:param indent_level: Each line of the rendering will be
indented this many spaces.
:param eventual_encoding: The tag is destined to be
encoded into this encoding. This method is _not_
responsible for performing that encoding. This information
is passed in so that it can be substituted in if the
document contains a <META> tag that mentions the document's
encoding.
:param formatter: The output formatter responsible for converting
entities to Unicode characters.
"""
BeautifulSoup 3 doesn't have the above functions, instead it has renderContents
def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
prettyPrint=False, indentLevel=0):
"""Renders the contents of this tag as a string in the given
encoding. If encoding is None, returns a Unicode string.."""
This function was added back to BeautifulSoup 4 (in 4.0.4) for compatibility with BS3.
Upvotes: 106
Reputation: 4126
How about just unicode(x)
? Seems to work for me.
Edit: This will give you the outer HTML and not the inner.
Upvotes: 1
Reputation: 430
One of the options could be use something like that:
innerhtml = "".join([str(x) for x in div_element.contents])
Upvotes: 17