lclankyo
lclankyo

Reputation: 259

Python BeautifulSoup find element that contains text

<div class="info">
       <h3> Height:
            <span>1.1</span>
       </h3>
</div>

<div class="info">
       <h3> Number:
            <span>111111111</span>
       </h3>
</div>

This is a partial portion of the site. Ultimately, I want to extract the 111111111. I know I can do soup.find_all("div", { "class" : "info" }) to get a list of both divs; however, I would prefer to not have to perform a loop to check if it contains the text "Number".

Is there a more elegant way to extract "1111111" so that it does soup.find_all("div", { "class" : "info" }), but also makes it so that it MUST contain "Number" within?

I also tried numberSoup = soup.find('h3', text='Number') but it returns None

Upvotes: 4

Views: 10347

Answers (2)

dokelung
dokelung

Reputation: 216

You can write your own filter function and let it be the argument of function find_all.

from bs4 import BeautifulSoup

def number_span(tag):
    return tag.name=='span' and 'Number:' in tag.parent.contents[0]

soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all(number_span)

By the way, the reason you can't fetch tags with the text param is: text param helps us find tags whose .string value equal to its value. And if a tag contains more than one thing then it is not clear what .string should refer to. So .string is defined to be None.

You can reference to beautiful soup doc.

Upvotes: 7

JRazor
JRazor

Reputation: 2817

Use xpath contains:

root.xpath('//div/h3[contains(text(), "Number")]/span/text()')

Upvotes: 3

Related Questions