Reputation: 1886
I have some HTML which looks like this:
<textarea><p></textarea>
If I do something like this in Python:
import bs4
doc = bs4.BeautifulSoup("<textarea><p></textarea>", "html.parser")
print(doc.select("textarea")[0].string)
The result <p>
is printed. This is categorically false and incredibly misleading, the actual contents of this element do not include the <
or >
characters at all.
How can I get the actual content inside an element, as I'd see if I'd manually curl
'd the page? Can I turn off this feature?
I've also tried this:
>>> for c in doc.select("textarea")[0].children:
... print(c)
...
<p>
Upvotes: 2
Views: 861
Reputation: 474041
This is the default documented behavior of the bs4
package:
If you give Beautiful Soup a document that contains HTML entities like “&lquot;”, they’ll be converted to Unicode characters. If you then convert the document to a string, the Unicode characters will be encoded as UTF-8. You won’t get the HTML entities back. By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into
&
,<
, and>
, so that Beautiful Soup doesn’t inadvertently generate invalid HTML or XML.
You can though get your entities back as is on output:
In [1]: import bs4
In [2]: doc = bs4.BeautifulSoup("<textarea><p></textarea>", "html.parser")
In [3]: textarea = doc.select_one("textarea")
In [4]: textarea.unwrap()
Out[4]: <textarea></textarea>
In [5]: print(doc)
<p>
Upvotes: 2