Reputation: 952
I'm trying to extract the second of 2 identical 'div' from a a soup element. When parsing trough and extracting with the .find() method, it gets exclusively the first from the top. How can I tell the script to skip the first and get the next one if some conditions are met? Here below is the html code I want to extract from.
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>
This is the code I'm trying:
if '$' not in str(product.find('div', {'class': 'a-row a-size-base a-color-secondary'})):
print('NOT IN')
pass
price = product.find('div', {'class': 'a-row a-size-base a-color-secondary'})
print(price)
else:
price = product.find('div', {'class': 'a-row a-size-base a-color-secondary'})
print(price)
However as results it still gives me this:
NOT IN
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
Rather then this:
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
Any suggestions?
Upvotes: 1
Views: 334
Reputation: 195438
Assuming the divs are now directly under the <body>
you can use standard Python indexing. In your real code replace body
in selector with appropriate element:
data = '''<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'lxml')
print(soup.select('body > div')[1].text.strip())
Prints:
$0.00 with a CONtv trial on Prime Video Channels
Note the >
sign in select()
It means we want all <div>
directly under the <body>
.
Upvotes: 1
Reputation: 84465
You need find_all
then index into returned list as find
only ever returns first match. You can do same thing with select
. With bs4 4.7.1. you can use :contains
to target innerText
of element by a substring (e.g. CONtv trial
) and then use select_one
if first match wanted or select
if multiple matches. You want to test if None
first before attempting to access .text
from bs4 import BeautifulSoup as bs
import requests
html = '''
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>
'''
soup = bs(html, 'lxml')
print(soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})[1].text)
print(soup.select('.a-color-secondary')[1].text)
print(soup.select_one('.a-color-secondary:contains("CONtv trial")').text)
Looping with find_all
matches = soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})
for item in matches:
if '$' in str(item):
print(item.text)
Upvotes: 1