Get actual content within element with BeautifulSoup

Question

I have some HTML which looks like this:

<p>

If I do something like this in Python:

import bs4
doc = bs4.BeautifulSoup("<p>", "html.parser")
print(doc.select("textarea")[0].string)

The result

is printed. This is categorically false and incredibly misleading, the actual contents of this element do not include the < or > characters at all.

How can I get the actual content inside an element, as I'd see if I'd manually curl'd the page? Can I turn off this feature?

I've also tried this:

>>> for c in doc.select("textarea")[0].children:
...   print(c)
...

alecxe · Accepted Answer

This is the default documented behavior of the bs4 package:

If you give Beautiful Soup a document that contains HTML entities like “&lquot;”, they’ll be converted to Unicode characters. If you then convert the document to a string, the Unicode characters will be encoded as UTF-8. You won’t get the HTML entities back. By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into &, <, and >, so that Beautiful Soup doesn’t inadvertently generate invalid HTML or XML.

You can though get your entities back as is on output:

In [1]: import bs4

In [2]: doc = bs4.BeautifulSoup("<p>", "html.parser")

In [3]: textarea = doc.select_one("textarea")

In [4]: textarea.unwrap()
Out[4]: 

In [5]: print(doc)
<p>

Get actual content within element with BeautifulSoup

Answers (1)

Related Questions