Reputation: 9
I tried scraping using BeautifulSoup but it returns []
. Then when I tried viewing the source code there's div class="loading32"
.
How do you scrape this kind of elements ?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = productUrl # bs4 part
uClient = uReq(my_url) # bs4 part
page_html = uClient.read() # bs4 part
uClient.close() # bs4 part
page_soup = soup(page_html, "html.parser") # bs4 part
description = page_soup.findAll("div", {"class": "ui-box product-description-main"})
string4 = str(description)
<div class="ui-box product-description-main" id="j-product-description">
<div class="ui-box-title">Product Description</div>
<div class="ui-box-body">
<div class="description-content" data-role="description" data-spm="1000023">
<div class="loading32"></div>
</div>
</div>
</div>
Upvotes: 1
Views: 1067
Reputation: 46759
The information is all there, it does not need javascript to be used. You just need to look through the HTML that is returned and decide the best method to extract each item that you want. I have guessed you might be trying to get something like the following:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup
my_url = 'https://www.aliexpress.com/store/product/100-Original-16-Shimano-Casitas-150-151-150hg-151hg-Right-Left-Hand-Baitcasting-Fishing-Reel-4/1053031_32657797704.html?spm=2114.12010608.0.0.22e12d66I7a3Dp'
uClient = uReq(my_url) # bs4 part
page_html = uClient.read() # bs4 part
uClient.close() # bs4 part
soup = BeautifulSoup(page_html, "html.parser") # bs4 part
details = {}
details['Product Name'] = soup.find('h1', class_='product-name').text
details['Price Range'] = soup.find('div', class_='p-current-price').find_all('span')[1].text
item_specifics = soup.find('ul', class_='product-property-list util-clearfix')
for li in item_specifics.find_all('li'):
entry = li.get_text(strip=True).split(':')
details[entry[0]] = ', '.join(entry[1:])
# Locate the image
li = soup.find('div', class_='ui-image-viewer-thumb-wrap')
url = li.img['src']
details['Image URL'] = url
details['Image Filename'] = url.rsplit('/', 1)[1]
for item, desc in details.items():
print('{:30} {}'.format(item, desc))
Would give you the following information:
Product Name Original 2016 Shimano Casitas 150 151 150hg 151hg Right Left Hand Baitcasting Fishing Reel 4+1BB 5.5kg SVS Infinity fishing reel
Price Range 83.60 - 85.60
Fishing Method Bait Casting
Baits Type Fake Bait
Position Ocean Rock Fshing,River,Stream,Reservoir Pond,Ocean Beach Fishing,Lake,Ocean Boat Fishing
Fishing Reels Type Baitcast Reel
Model Number Casitas
Brand Name Shimano
Ball Bearings 4+1BB
Feature 1 Shimano Stable Spool S3D
Feature 2 SVS Infinity Brake System (Infinite Cast Control)
Model 150/ 151/ 150HG/ 151HG
PE Line (50 test /m) 20-150/30-135/ 40-105
Nylon Line (51hg test /m) 10-120/12-110/14-90
Weight 190g
Gear Ratio 6.3, 1 / 7.2, 1
Made in Malaysia
Image URL https://ae01.alicdn.com/kf/HTB1qRKzJFXXXXboXVXXq6xXFXXXU/Original-2016-Shimano-Casitas-150-151-150hg-151hg-Right-Left-Hand-Baitcasting-Fishing-Reel-4-1BB.jpg_640x640.jpg
Image Filename Original-2016-Shimano-Casitas-150-151-150hg-151hg-Right-Left-Hand-Baitcasting-Fishing-Reel-4-1BB.jpg_640x640.jpg
The image information is also stored. This could then be downloaded using another uReq
call and saving the data as binary into a file using the filename obtained.
Upvotes: 0
Reputation: 353
So the problem here is, these loading32
elements are being loaded via compiled javascript
on the client end. This is an ideal use case for Splash
. ScrapingHub
has this renderer
that can be used via curl API
and you can execute some Lua
code also that can help you circumvent a lot of problems like js triggered page loads, waits, clicks and whatnot.
Link : Splash Documentation
Also, you can integrate this Splash
with Scrapy
, amazing right.
Link : Scrapy Splash Github
Upvotes: 1