Reputation: 167
soup = BeautifulSoup("<p>'hello'<a>'my link'</a></p>", 'html.parser')
print(soup.p.string)
None
Is the output normal?
Upvotes: 0
Views: 215
Reputation: 901
>>>soup = BeautifulSoup("<p>adA<a>asda</a>asda</p>")
>>> soup.p
<p>adA<a>asda</a>asda</p>
>>> soup.p.text
u'adAasdaasda'
I think that Bs can't really get only the paragraphs text because there is a a tag nested inside. I think that when you try to get the text it recursively gets the text from all children and appends it to the output.
Upvotes: 0
Reputation: 473803
Since the initially posted <\p>
was just a typo, here is what your problem is actually about.
It is about how .string
works in BeautifulSoup
. It works differently depending on the element's children - if an element has multiple children, it returns None
:
If a tag contains more than one thing, then it’s not clear what
.string
should refer to, so.string
is defined to beNone
Notice, how the .string
for the p
element is None
, but for a
it is not:
In [1]: from bs4 import BeautifulSoup
In [2]: soup = BeautifulSoup("<p>'hello'<a>'my link'</a></p>", 'html.parser')
In [3]: print(soup.p.string)
None
In [4]: print(soup.p.a.string)
'my link'
The correct and more reliable way to get the element's text is via .get_text()
:
In [5]: print(soup.p.get_text(strip=True))
'hello''my link'
Upvotes: 1
Reputation: 8382
Althought <\p>
is invalid, lxml will try to close the first tag, so this code works. html.parser does not make a good job on that field.
soup = BeautifulSoup("<p>'hello'<a>'my link'</a></p>", 'lxml')
print(soup.p.get_text(strip=True))
Which outputs:
'hello''my link'
Upvotes: 2