Web-scraping dynamic HTML page structure

Question

I am working on a large-scale web scraping project where the HTML structure of each webpage is different from each other. I wanted to scrape the product description from the webpages and I am using the BeautifulSoup package.

For example, the product description that I am trying to scrape is stored in HTML structures:


   "Title" 
   "Some content" 
   "Product description" 




   "Title" 
   "Product description" 



   "Title" 
   "Some content" 
   "Some content" 
   "Product description" 




   "Title" 
   "Some-content" 
   "Some-content" 
   "Some-content" 
   "Product description"

I have written a for loop that gets the data from the div class "product-description" depending on the page structure. My sample code snippet:

requests = (grequests.get(url) for url in urls)
responses = grequests.imap(requests, grequests.Pool(1000))

for response in responses:

        html_soup = BeautifulSoup(response.text, 'html.parser')

        if html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling:
                product_description = html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling.text

        elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling.next_sibling:
                product_description = html_soup.find(
                  'div', class_='product_description').next_element.next_sibling.next_sibling.next_sibling.text

        elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling:
                product_description = html_soup.find(
                  'div', class_='product_description').next_element.next_sibling.next_sibling.text

        else:
                product_description = html_soup.find(
                  'div', class_='product_description').next_element.next_sibling.text

I expected the if conditions to check if there are siblings in the current level of HTML and if not check for subsequent conditions. However, after 3000 iterations, I am getting an Attribute error saying Nonetype object has no attribute next_sibling. Screenshot attached below:

I know there must be some other easier way to handle this dynamic page structure. Any help would be much appreciated. Thanks in advance!

Joshua Varghese · Accepted Answer

Try this:

for i in soup.find_all('div',class_="product-description"):
    try:
        print(i.find_all('p')[-1].text)
    except:
        pass

Here soup is:


   "Title" 
   "Some content" 
   "Product description" 




   "Title" 
   "Product description" 



   "Title" 
   "Some content" 
   "Some content" 
   "Product description" 




   "Title" 
   "Some-content" 
   "Some-content" 
   "Some-content" 
   "Product description"

Web-scraping dynamic HTML page structure

Answers (1)

Related Questions