Pro Girl
Pro Girl

Reputation: 952

Extracting with .find() the second of 2 identical 'div' from html page with BS4

I'm trying to extract the second of 2 identical 'div' from a a soup element. When parsing trough and extracting with the .find() method, it gets exclusively the first from the top. How can I tell the script to skip the first and get the next one if some conditions are met? Here below is the html code I want to extract from.

<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>

This is the code I'm trying:

if '$' not in str(product.find('div', {'class': 'a-row a-size-base a-color-secondary'})):
    print('NOT IN')
    pass
    price = product.find('div', {'class': 'a-row a-size-base a-color-secondary'})
    print(price)
else:
    price = product.find('div', {'class': 'a-row a-size-base a-color-secondary'})
    print(price)

However as results it still gives me this:

NOT IN
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>

Rather then this:

<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div> 

Any suggestions?

Upvotes: 1

Views: 334

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195438

Assuming the divs are now directly under the <body> you can use standard Python indexing. In your real code replace body in selector with appropriate element:

data = '''<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>'''

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(data, 'lxml')

print(soup.select('body > div')[1].text.strip())

Prints:

$0.00 with a CONtv trial on Prime Video Channels

Note the > sign in select() It means we want all <div> directly under the <body>.

Upvotes: 1

QHarr
QHarr

Reputation: 84465

You need find_all then index into returned list as find only ever returns first match. You can do same thing with select. With bs4 4.7.1. you can use :contains to target innerText of element by a substring (e.g. CONtv trial) and then use select_one if first match wanted or select if multiple matches. You want to test if None first before attempting to access .text

from bs4 import BeautifulSoup as bs
import requests

html = '''
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>
'''
soup = bs(html, 'lxml')
print(soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})[1].text)
print(soup.select('.a-color-secondary')[1].text)
print(soup.select_one('.a-color-secondary:contains("CONtv trial")').text)

Looping with find_all

matches = soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})
for item in matches:
    if '$' in str(item):
        print(item.text)

Upvotes: 1

Related Questions