Jack AQ
Jack AQ

Reputation: 9

BeautifulSoup-Python : How do you scrape data that has not been loaded yet?

I tried scraping using BeautifulSoup but it returns []. Then when I tried viewing the source code there's div class="loading32".

How do you scrape this kind of elements ?

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = productUrl  # bs4 part
uClient = uReq(my_url)  # bs4 part
page_html = uClient.read()  # bs4 part
uClient.close()  # bs4 part
page_soup = soup(page_html, "html.parser")  # bs4 part
description = page_soup.findAll("div", {"class": "ui-box product-description-main"})
string4 = str(description)

URL : https://www.aliexpress.com/store/product/100-Original-16-Shimano-Casitas-150-151-150hg-151hg-Right-Left-Hand-Baitcasting-Fishing-Reel-4/1053031_32657797704.html?spm=2114.12010608.0.0.22e12d66I7a3Dp

<div class="ui-box product-description-main" id="j-product-description">
        <div class="ui-box-title">Product Description</div>
        <div class="ui-box-body">

            <div class="description-content" data-role="description" data-spm="1000023">
            <div class="loading32"></div>
            </div>

        </div>
    </div>

Upvotes: 1

Views: 1067

Answers (2)

Martin Evans
Martin Evans

Reputation: 46759

The information is all there, it does not need javascript to be used. You just need to look through the HTML that is returned and decide the best method to extract each item that you want. I have guessed you might be trying to get something like the following:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup 

my_url = 'https://www.aliexpress.com/store/product/100-Original-16-Shimano-Casitas-150-151-150hg-151hg-Right-Left-Hand-Baitcasting-Fishing-Reel-4/1053031_32657797704.html?spm=2114.12010608.0.0.22e12d66I7a3Dp'
uClient = uReq(my_url)  # bs4 part
page_html = uClient.read()  # bs4 part
uClient.close()  # bs4 part

soup = BeautifulSoup(page_html, "html.parser")  # bs4 part

details = {}
details['Product Name'] = soup.find('h1', class_='product-name').text
details['Price Range'] = soup.find('div', class_='p-current-price').find_all('span')[1].text

item_specifics = soup.find('ul', class_='product-property-list util-clearfix')
for li in item_specifics.find_all('li'):
    entry = li.get_text(strip=True).split(':')
    details[entry[0]] = ', '.join(entry[1:])

# Locate the image    
li = soup.find('div', class_='ui-image-viewer-thumb-wrap')
url = li.img['src']
details['Image URL'] = url
details['Image Filename'] = url.rsplit('/', 1)[1]

for item, desc in details.items():
    print('{:30} {}'.format(item, desc))

Would give you the following information:

Product Name                   Original 2016 Shimano Casitas 150 151 150hg 151hg Right Left Hand Baitcasting Fishing Reel 4+1BB 5.5kg SVS Infinity fishing reel
Price Range                    83.60 - 85.60
Fishing Method                 Bait Casting
Baits Type                     Fake Bait
Position                       Ocean Rock Fshing,River,Stream,Reservoir Pond,Ocean Beach Fishing,Lake,Ocean Boat Fishing
Fishing Reels Type             Baitcast Reel
Model Number                   Casitas
Brand Name                     Shimano
Ball Bearings                  4+1BB
Feature 1                      Shimano Stable Spool S3D
Feature 2                      SVS Infinity Brake System (Infinite Cast Control)
Model                          150/ 151/ 150HG/ 151HG
PE Line (50 test /m)           20-150/30-135/ 40-105
Nylon Line (51hg test /m)      10-120/12-110/14-90
Weight                         190g
Gear Ratio                     6.3, 1 / 7.2, 1
Made in                        Malaysia
Image URL                      https://ae01.alicdn.com/kf/HTB1qRKzJFXXXXboXVXXq6xXFXXXU/Original-2016-Shimano-Casitas-150-151-150hg-151hg-Right-Left-Hand-Baitcasting-Fishing-Reel-4-1BB.jpg_640x640.jpg
Image Filename                 Original-2016-Shimano-Casitas-150-151-150hg-151hg-Right-Left-Hand-Baitcasting-Fishing-Reel-4-1BB.jpg_640x640.jpg

The image information is also stored. This could then be downloaded using another uReq call and saving the data as binary into a file using the filename obtained.

Upvotes: 0

yadavankit
yadavankit

Reputation: 353

So the problem here is, these loading32 elements are being loaded via compiled javascript on the client end. This is an ideal use case for Splash. ScrapingHub has this renderer that can be used via curl API and you can execute some Lua code also that can help you circumvent a lot of problems like js triggered page loads, waits, clicks and whatnot.

Link : Splash Documentation

Also, you can integrate this Splash with Scrapy, amazing right.

Link : Scrapy Splash Github

Upvotes: 1

Related Questions