user2293224
user2293224

Reputation: 2220

Python BeautifulSoup: Retrieving text from div tag

I am newbie in web scraping. I am using beautiful soup for extracting google play store. However, I stuck to retrieve text from a div tag. Div tag looks like below:

a = <`div class="LVQB0b"><div class="QoPmEb"></div><div><span class="X43Kjb">Education.com</span><span class="p2TkOb">August 15, 2019</span></div>Thanks for your feedback. We are sorry to hear you're having trouble with the app. This is a known issue and our team has fixed it. Please restart the app and let us know at [email protected] if you have any further trouble. Thanks!</div>` 

I want to retrieve the text starting from "Thanks for your feedback". I used the following code to retrieve the text:

response = a.find('div',{'class':'LVQB0b'}).get_text()

However, the above command also returns unwanted text i.e. 'Education.com' and the date. I am not sure how to retrieve the text from div tag which does not have class name as shown above in the example. Waiting for your guidance.

Upvotes: 1

Views: 307

Answers (3)

KunduK
KunduK

Reputation: 33384

As an alternative You can use next_sibling or find_next_sibling(text=True)

from bs4 import BeautifulSoup

html= '''<div class="LVQB0b"><div class="QoPmEb"></div><div><span class="X43Kjb">Education.com</span><span class="p2TkOb">August 15, 2019</span></div>Thanks for your feedback. We are sorry to hear you're having trouble with the app. This is a known issue and our team has fixed it. Please restart the app and let us know at [email protected] if you have any further trouble. Thanks!</div>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('div',class_='QoPmEb').find_next('div').next_sibling)

from bs4 import BeautifulSoup

html= '''<div class="LVQB0b"><div class="QoPmEb"></div><div><span class="X43Kjb">Education.com</span><span class="p2TkOb">August 15, 2019</span></div>Thanks for your feedback. We are sorry to hear you're having trouble with the app. This is a known issue and our team has fixed it. Please restart the app and let us know at [email protected] if you have any further trouble. Thanks!</div>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('div',class_='QoPmEb').find_next('div').find_next_sibling(text=True))

Upvotes: 1

Guy
Guy

Reputation: 50809

The unwanted text is part of the <div class="LVQB0b"> element. You can locate those elements and remove their texts from the result

response = a.find('div',{'class':'LVQB0b'}).get_text()
unwanted = a.select('.LVQB0b span')
for el in unwanted:
    response = response.replace(el.get_text(), '')

Upvotes: 2

Rakesh
Rakesh

Reputation: 82755

Use find(text=True, recursive=False)

Ex:

from bs4 import BeautifulSoup

s = '''<div class="LVQB0b"><div class="QoPmEb"></div><div><span class="X43Kjb">Education.com</span><span class="p2TkOb">August 15, 2019</span></div>Thanks for your feedback. We are sorry to hear you're having trouble with the app. This is a known issue and our team has fixed it. Please restart the app and let us know at [email protected] if you have any further trouble. Thanks!</div>'''    
html = BeautifulSoup(s, 'html.parser')
print(html.find('div',{'class':'LVQB0b'}).find(text=True, recursive=False))

Output:

Thanks for your feedback. We are sorry to hear you're having trouble with the app. This is a known issue and our team has fixed it. Please restart the app and let us know at [email protected] if you have any further trouble. Thanks!

Upvotes: 4

Related Questions