Reputation: 2220
I am newbie in web scraping. I am using beautiful soup for extracting google play store. However, I stuck to retrieve text from a div tag. Div tag looks like below:
a = <`div class="LVQB0b"><div class="QoPmEb"></div><div><span class="X43Kjb">Education.com</span><span class="p2TkOb">August 15, 2019</span></div>Thanks for your feedback. We are sorry to hear you're having trouble with the app. This is a known issue and our team has fixed it. Please restart the app and let us know at [email protected] if you have any further trouble. Thanks!</div>`
I want to retrieve the text starting from "Thanks for your feedback". I used the following code to retrieve the text:
response = a.find('div',{'class':'LVQB0b'}).get_text()
However, the above command also returns unwanted text i.e. 'Education.com' and the date. I am not sure how to retrieve the text from div tag which does not have class name as shown above in the example. Waiting for your guidance.
Upvotes: 1
Views: 307
Reputation: 33384
As an alternative You can use next_sibling
or find_next_sibling(text=True)
from bs4 import BeautifulSoup
html= '''<div class="LVQB0b"><div class="QoPmEb"></div><div><span class="X43Kjb">Education.com</span><span class="p2TkOb">August 15, 2019</span></div>Thanks for your feedback. We are sorry to hear you're having trouble with the app. This is a known issue and our team has fixed it. Please restart the app and let us know at [email protected] if you have any further trouble. Thanks!</div>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('div',class_='QoPmEb').find_next('div').next_sibling)
from bs4 import BeautifulSoup
html= '''<div class="LVQB0b"><div class="QoPmEb"></div><div><span class="X43Kjb">Education.com</span><span class="p2TkOb">August 15, 2019</span></div>Thanks for your feedback. We are sorry to hear you're having trouble with the app. This is a known issue and our team has fixed it. Please restart the app and let us know at [email protected] if you have any further trouble. Thanks!</div>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('div',class_='QoPmEb').find_next('div').find_next_sibling(text=True))
Upvotes: 1
Reputation: 50809
The unwanted text is part of the <div class="LVQB0b">
element. You can locate those elements and remove their texts from the result
response = a.find('div',{'class':'LVQB0b'}).get_text()
unwanted = a.select('.LVQB0b span')
for el in unwanted:
response = response.replace(el.get_text(), '')
Upvotes: 2
Reputation: 82755
Use find(text=True, recursive=False)
Ex:
from bs4 import BeautifulSoup
s = '''<div class="LVQB0b"><div class="QoPmEb"></div><div><span class="X43Kjb">Education.com</span><span class="p2TkOb">August 15, 2019</span></div>Thanks for your feedback. We are sorry to hear you're having trouble with the app. This is a known issue and our team has fixed it. Please restart the app and let us know at [email protected] if you have any further trouble. Thanks!</div>'''
html = BeautifulSoup(s, 'html.parser')
print(html.find('div',{'class':'LVQB0b'}).find(text=True, recursive=False))
Output:
Thanks for your feedback. We are sorry to hear you're having trouble with the app. This is a known issue and our team has fixed it. Please restart the app and let us know at [email protected] if you have any further trouble. Thanks!
Upvotes: 4