Hamza
Hamza

Reputation: 167

Can't get the text between the opening and closing tag

soup = BeautifulSoup("<p>'hello'<a>'my link'</a></p>", 'html.parser')
print(soup.p.string)
None

Is the output normal?

Upvotes: 0

Views: 215

Answers (3)

Dawid Dave Kosiński
Dawid Dave Kosiński

Reputation: 901

>>>soup = BeautifulSoup("<p>adA<a>asda</a>asda</p>")
>>> soup.p
<p>adA<a>asda</a>asda</p>
>>> soup.p.text
u'adAasdaasda'

I think that Bs can't really get only the paragraphs text because there is a a tag nested inside. I think that when you try to get the text it recursively gets the text from all children and appends it to the output.

Upvotes: 0

alecxe
alecxe

Reputation: 473803

Since the initially posted <\p> was just a typo, here is what your problem is actually about.

It is about how .string works in BeautifulSoup. It works differently depending on the element's children - if an element has multiple children, it returns None:

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None

Notice, how the .string for the p element is None, but for a it is not:

In [1]: from bs4 import BeautifulSoup

In [2]: soup = BeautifulSoup("<p>'hello'<a>'my link'</a></p>", 'html.parser')

In [3]: print(soup.p.string)
None

In [4]: print(soup.p.a.string)
'my link'

The correct and more reliable way to get the element's text is via .get_text():

In [5]: print(soup.p.get_text(strip=True))
'hello''my link'

Upvotes: 1

Zroq
Zroq

Reputation: 8382

Althought <\p> is invalid, lxml will try to close the first tag, so this code works. html.parser does not make a good job on that field.

soup = BeautifulSoup("<p>'hello'<a>'my link'</a></p>", 'lxml')
print(soup.p.get_text(strip=True))

Which outputs:

'hello''my link'

Upvotes: 2

Related Questions