Reputation: 847
I am working on a large-scale web scraping project where the HTML structure of each webpage is different from each other. I wanted to scrape the product description from the webpages and I am using the BeautifulSoup package.
For example, the product description that I am trying to scrape is stored in HTML structures:
<div class="product-description">
<p> "Title" </p>
<p> "Some content" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Some content" </p>
<p> "Some content" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Some-content" </p>
<p> "Some-content" </p>
<p> "Some-content" </p>
<p> "Product description" </p>
</div>
I have written a for loop that gets the data from the div class "product-description" depending on the page structure. My sample code snippet:
requests = (grequests.get(url) for url in urls)
responses = grequests.imap(requests, grequests.Pool(1000))
for response in responses:
html_soup = BeautifulSoup(response.text, 'html.parser')
if html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling:
product_description = html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling.text
elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling.next_sibling:
product_description = html_soup.find(
'div', class_='product_description').next_element.next_sibling.next_sibling.next_sibling.text
elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling:
product_description = html_soup.find(
'div', class_='product_description').next_element.next_sibling.next_sibling.text
else:
product_description = html_soup.find(
'div', class_='product_description').next_element.next_sibling.text
I expected the if conditions to check if there are siblings in the current level of HTML and if not check for subsequent conditions. However, after 3000 iterations, I am getting an Attribute error
saying Nonetype object has no attribute next_sibling
. Screenshot attached below:
I know there must be some other easier way to handle this dynamic page structure. Any help would be much appreciated. Thanks in advance!
Upvotes: 0
Views: 338
Reputation: 5202
Try this:
for i in soup.find_all('div',class_="product-description"):
try:
print(i.find_all('p')[-1].text)
except:
pass
Here soup is:
<div class="product-description">
<p> "Title" </p>
<p> "Some content" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Some content" </p>
<p> "Some content" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Some-content" </p>
<p> "Some-content" </p>
<p> "Some-content" </p>
<p> "Product description" </p>
</div>
Upvotes: 1