Reputation: 45
So playing around with the python bs4 and trying to work out how to ignore the same DIV name to collect the data for the second lot.
Below is an example of the code I am try to extract ##Wanted data##
##Pointless Data###
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
</div>
##Wanted data##
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html_text>" , "html.parser")
div = soup.find("div")
div.find_all("div", {"class":"PowerDetails"})
PowerDetails[1].find_all("p", "class":"RunningCost")
PowerDetails[1].find_all("p", "class":"Time")
Upvotes: 0
Views: 677
Reputation: 2469
Another method.
from simplified_scrapy import SimplifiedDoc
html = '''
##Pointless Data###
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
</div>
##Wanted data##
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>
'''
doc = SimplifiedDoc(html)
# First method, get all, use index.
PowerDetails = doc.selects('div.PowerDetails')[1].selects(
'div.Company').selects('p')
for ps in PowerDetails:
print([(p['class'], p.text) for p in ps])
# Second method, skip the first with parameter start
PowerDetails = doc.getElement(
'div', value='PowerDetails',
start='class="PowerDetails"').selects('div.Company').selects('p')
for ps in PowerDetails:
print([(p['class'], p.text) for p in ps])
Result:
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$9.99'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$7.77'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$9.99'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$7.77'), ('Time', 'Off-peek')]
Upvotes: 0
Reputation: 1167
You can slice the resultant list to get elements from the 1st index onwards. But, first you are not finding the right tags in your code.
from bs4 import BeautifulSoup
html_doc = """
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
</div>
##Wanted data##
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>
"""
soup = BeautifulSoup(html_doc , "html.parser")
# You can get the divs with one line of code
powerDetails = soup.find_all(class_="PowerDetails")
print(len(powerDetails)) # Outputs 2
Now, you can slice the list to ignore the first div
powerDetails = powerDetails[1:] # Get elements from 2nd element onwards (ignoring the first one)
print(len(powerDetails)) # Outputs 1
Now, you will have a list with one element only
print(powerDetails)
Output:
[<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>]
Upvotes: 0
Reputation: 14233
find_all()
will return list
. use slicing or index to access just elements you want.
Upvotes: 1