Adam Williams
Adam Williams

Reputation: 1886

Get actual content within element with BeautifulSoup

I have some HTML which looks like this:

<textarea>&lt;p&gt;</textarea>

If I do something like this in Python:

import bs4
doc = bs4.BeautifulSoup("<textarea>&lt;p&gt;</textarea>", "html.parser")
print(doc.select("textarea")[0].string)

The result <p> is printed. This is categorically false and incredibly misleading, the actual contents of this element do not include the < or > characters at all.

How can I get the actual content inside an element, as I'd see if I'd manually curl'd the page? Can I turn off this feature?


I've also tried this:

>>> for c in doc.select("textarea")[0].children:
...   print(c)
... 
<p>

Upvotes: 2

Views: 861

Answers (1)

alecxe
alecxe

Reputation: 474041

This is the default documented behavior of the bs4 package:

If you give Beautiful Soup a document that contains HTML entities like “&lquot;”, they’ll be converted to Unicode characters. If you then convert the document to a string, the Unicode characters will be encoded as UTF-8. You won’t get the HTML entities back. By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into &amp;, &lt;, and &gt;, so that Beautiful Soup doesn’t inadvertently generate invalid HTML or XML.

You can though get your entities back as is on output:

In [1]: import bs4

In [2]: doc = bs4.BeautifulSoup("<textarea>&lt;p&gt;</textarea>", "html.parser")

In [3]: textarea = doc.select_one("textarea")

In [4]: textarea.unwrap()
Out[4]: <textarea></textarea>

In [5]: print(doc)
&lt;p&gt;

Upvotes: 2

Related Questions