Reputation: 133
I want to scrape a few URL that have 2 divs
using same class="description"
,
The source code of a sample URL is like this:
<!-- Initial HTML here -->
<div class="description">
<h4> Anonymous Title </h4>
<div class="product-description">
<li> Some stuff here </li>
</div>
</div>
<!-- Middle HTML here -->
<div class="description">
Some text here
</div>
<!-- Last HTML here -->
I'm scraping it using BeautifulSoap using following script
# imports etc here
description_box = soup.find('div', attrs={'class': 'description'})
description = description_box.text.strip()
print description
Running it gives me the first div
with class="description"
only however I want the second div
with class="description"
only.
Any ideas how I can ignore the first div
and just scrape the second?
P.S. First div
always have h4
tags and second div
only has plain text in between tags.
Upvotes: 0
Views: 2196
Reputation: 84465
You can use type with class selector in css and index into returned collection
print(soup.select('div.description')[1].text)
Upvotes: 0
Reputation: 5958
Use css-selector
as it contains the nth-of-type
attribute to select the nth element of your specification. Also, syntax is cleaner.
description_box = soup.select("div.description:nth-of-type(2)")[0]
Upvotes: 0
Reputation: 28585
If you do .find_all
, it'll return all in a list. It's then just a matter of selecting the 2nd item in that list using index 1:
html = '''<!-- Initial HTML here -->
<div class="description">
<h4> Anonymous Title </h4>
<div class="product-description">
<li> Some stuff here </li>
</div>
</div>
<!-- Middle HTML here -->
<div class="description">
Some text here
</div>
<!-- Last HTML here -->'''
soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all('div', {'class':'description'})
div = divs[1]
Output:
print (div)
<div class="description">
Some text here
</div>
Upvotes: 2