user3647856
user3647856

Reputation: 45

Python BeautifulSoup Same name DIV, ignore first

So playing around with the python bs4 and trying to work out how to ignore the same DIV name to collect the data for the second lot.

Below is an example of the code I am try to extract ##Wanted data##

##Pointless Data###
<div class="PowerDetails">
<div class="Company">
        <p class="RunningCost">$4.44</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $2.33</p>           
        <p class="Time">Off-peek</p>
</div>
</div>

##Wanted data##
<div class="PowerDetails">
  <div class="Company">
        <p class="RunningCost">$8.88</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $9.99</p>           
        <p class="Time">Off-peek</p>
  </div>
  <div class="Company">
        <p class="RunningCost">$8.88</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $7.77</p>           
        <p class="Time">Off-peek</p>
  </div>
</div>
from bs4 import BeautifulSoup

soup = BeautifulSoup("<html_text>" , "html.parser")

div = soup.find("div")

div.find_all("div", {"class":"PowerDetails"})

PowerDetails[1].find_all("p", "class":"RunningCost")
PowerDetails[1].find_all("p", "class":"Time")

Upvotes: 0

Views: 677

Answers (3)

dabingsou
dabingsou

Reputation: 2469

Another method.

from simplified_scrapy import SimplifiedDoc

html = '''
##Pointless Data###
<div class="PowerDetails">
<div class="Company">
        <p class="RunningCost">$4.44</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $2.33</p>           
        <p class="Time">Off-peek</p>
</div>
</div>

##Wanted data##
<div class="PowerDetails">
  <div class="Company">
        <p class="RunningCost">$8.88</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $9.99</p>           
        <p class="Time">Off-peek</p>
  </div>
  <div class="Company">
        <p class="RunningCost">$8.88</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $7.77</p>           
        <p class="Time">Off-peek</p>
  </div>
</div>
'''

doc = SimplifiedDoc(html)

# First method, get all, use index.
PowerDetails = doc.selects('div.PowerDetails')[1].selects(
    'div.Company').selects('p')
for ps in PowerDetails:
    print([(p['class'], p.text) for p in ps])

# Second method, skip the first with parameter start
PowerDetails = doc.getElement(
    'div', value='PowerDetails',
    start='class="PowerDetails"').selects('div.Company').selects('p')
for ps in PowerDetails:
    print([(p['class'], p.text) for p in ps])

Result:

[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$9.99'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$7.77'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$9.99'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$7.77'), ('Time', 'Off-peek')]

Upvotes: 0

Moosa Saadat
Moosa Saadat

Reputation: 1167

You can slice the resultant list to get elements from the 1st index onwards. But, first you are not finding the right tags in your code.

from bs4 import BeautifulSoup

html_doc = """
<div class="PowerDetails">
<div class="Company">
        <p class="RunningCost">$4.44</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $2.33</p>           
        <p class="Time">Off-peek</p>
</div>
</div>

##Wanted data##
<div class="PowerDetails">
  <div class="Company">
        <p class="RunningCost">$8.88</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $9.99</p>           
        <p class="Time">Off-peek</p>
  </div>
  <div class="Company">
        <p class="RunningCost">$8.88</p>
        <p class="Time">peek</p>
        <p class="RunningCost"> $7.77</p>           
        <p class="Time">Off-peek</p>
  </div>
</div>
"""
soup = BeautifulSoup(html_doc , "html.parser")

# You can get the divs with one line of code
powerDetails = soup.find_all(class_="PowerDetails")

print(len(powerDetails)) # Outputs 2

Now, you can slice the list to ignore the first div

powerDetails = powerDetails[1:] # Get elements from 2nd element onwards (ignoring the first one)
print(len(powerDetails)) # Outputs 1

Now, you will have a list with one element only

print(powerDetails)

Output:

[<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>]

Upvotes: 0

buran
buran

Reputation: 14233

find_all() will return list. use slicing or index to access just elements you want.

Upvotes: 1

Related Questions