Abishek
Abishek

Reputation: 847

Web-scraping dynamic HTML page structure

I am working on a large-scale web scraping project where the HTML structure of each webpage is different from each other. I wanted to scrape the product description from the webpages and I am using the BeautifulSoup package.

For example, the product description that I am trying to scrape is stored in HTML structures:

<div class="product-description">
  <p> "Title" </p>
  <p> "Some content" </p>
  <p> "Product description" </p>
</div>


<div class="product-description">
  <p> "Title" </p>
  <p> "Product description" </p>
</div>

<div class="product-description">
  <p> "Title" </p>
  <p> "Some content" </p>
  <p> "Some content" </p>
  <p> "Product description" </p>
</div>


<div class="product-description">
  <p> "Title" </p>
  <p> "Some-content" </p>
  <p> "Some-content" </p>
  <p> "Some-content" </p>
  <p> "Product description" </p>
</div>

I have written a for loop that gets the data from the div class "product-description" depending on the page structure. My sample code snippet:

requests = (grequests.get(url) for url in urls)
responses = grequests.imap(requests, grequests.Pool(1000))

for response in responses:

        html_soup = BeautifulSoup(response.text, 'html.parser')

        if html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling:
                product_description = html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling.text

        elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling.next_sibling:
                product_description = html_soup.find(
                  'div', class_='product_description').next_element.next_sibling.next_sibling.next_sibling.text

        elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling:
                product_description = html_soup.find(
                  'div', class_='product_description').next_element.next_sibling.next_sibling.text

        else:
                product_description = html_soup.find(
                  'div', class_='product_description').next_element.next_sibling.text

I expected the if conditions to check if there are siblings in the current level of HTML and if not check for subsequent conditions. However, after 3000 iterations, I am getting an Attribute error saying Nonetype object has no attribute next_sibling. Screenshot attached below:

Attribute error

I know there must be some other easier way to handle this dynamic page structure. Any help would be much appreciated. Thanks in advance!

Upvotes: 0

Views: 338

Answers (1)

Joshua Varghese
Joshua Varghese

Reputation: 5202

Try this:

for i in soup.find_all('div',class_="product-description"):
    try:
        print(i.find_all('p')[-1].text)
    except:
        pass

Here soup is:

<div class="product-description">
  <p> "Title" </p>
  <p> "Some content" </p>
  <p> "Product description" </p>
</div>


<div class="product-description">
  <p> "Title" </p>
  <p> "Product description" </p>
</div>

<div class="product-description">
  <p> "Title" </p>
  <p> "Some content" </p>
  <p> "Some content" </p>
  <p> "Product description" </p>
</div>


<div class="product-description">
  <p> "Title" </p>
  <p> "Some-content" </p>
  <p> "Some-content" </p>
  <p> "Some-content" </p>
  <p> "Product description" </p>
</div>

Upvotes: 1

Related Questions