Reputation: 2108
I need to get HTML tag by part of its text. I found some solutions but it doesn't work well for me.
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup("""
<table>
<tbody>
<tr>
<td style="width: 100px; height: 20px">
<div style="font-size: 8.7pt">
Арт.:
<span id="ContentPlaceHolder1_ContentPlaceHolder1_DataList2_Label12_0"> 1185A</span>
</div>
<div style="font-size: 12pt; font-weight: bold;">
<span id="ContentPlaceHolder1_ContentPlaceHolder1_DataList2_LoginView3_0_Label12_0">I_CAN_GET_THIS other text</span>
I CAN NOT GET THIS?.
</div>
</td>
</tr>
</tbody>
</table>
""", 'lxml')
print(soup.find('span', text=re.compile('I_CAN_GET_THIS')))
print(soup.find('div', text=re.compile('I_CAN_NOT_GET_THIS')))
>>> <span id="ContentPlaceHolder1_ContentPlaceHolder1_DataList2_LoginView3_0_Label12_0">I_CAN_GET_THIS other text</span>
>>> None
So I can;t understand why it doesn't work in the second case and what should I do to make it works? Thanks
Upvotes: 2
Views: 691
Reputation: 474191
The text
argument (which is now renamed to string
but is still supported) would use the .string
attribute of an element which would become None
if there is more than one child:
If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None
This is exactly the case with your target div
element - it has a span
child and a text node.
Instead, you can locate the text node and then get it's parent:
soup.find(text=re.compile('I CAN NOT GET THIS')).parent
Or, use a searching function where you would use .get_text()
which combines children texts:
soup.find(lambda tag: tag.name == 'div' and 'I CAN NOT GET THIS' in tag.get_text())
Upvotes: 3