Reputation: 1575
I am having problems to scrape more than one page of data. In splash console, I managed to get 2-3 pages of HTML content. When in the Lua script in the first loop I define to iterate one time to extract one page I get 50 urls. If 2 or more iterations, no data is being returned. In console I get:
Ignoring response <504 https://shopee.sg/search?keyword=hdmi>: HTTP status code is not handled or not allowed
or
504 Gateway Time-out
Here is my code
class Shopee(scrapy.Spider):
name = 'shopee'
script = '''
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(5.0))
treat=require('treat')
result = {}
pages = splash:select('.shopee-mini-page-controller__total')
for i=1,3,1 do
for j=1,2,1 do
assert(splash:runjs("window.scrollBy(0, 1300)"))
assert(splash:wait(5.0))
end
result[i]=splash:html()
assert(splash:runjs('document.querySelector(".shopee-icon-button--right").click()'))
assert(splash:wait(8.0))
end
return treat.as_array(result)
end
'''
def start_requests(self):
urls = [
'https://shopee.sg/search?keyword=hdmi'
]
for link in urls:
yield SplashRequest(url=link, callback=self.parse, endpoint='execute', args={'wait': 2.5, 'lua_source' : self.script}, dont_filter=True)
def parse(self, response):
for page in response.data:
sel = Selector(text=page)
yield {
'urls': sel.xpath("//div[contains(@class, 'shopee-search-item-result__item')]//a[*]/@href").getall()
}
Upvotes: 0
Views: 409
Reputation: 1815
I think that you get a timeout error because of your lua script. When you make a request from the spider, the time for receive response begins. In your lua script you have the following:
Run js twice for scrolling, its take some time
Twice call function splash:wait(5.0)
for download and render some data
Then you call assert(splash:wait(8.0))
Final minimum time:
(3 * 8) + (2 * 5) + time to run splash:runjs
and some other thing
But in your case Splash isn't required. You can make a request for the next page directly from your spider. Chrome->Dev Tools->Network->XHR, there you will find request url https://shopee.sg/api/v2/search_items/?by=relevancy&keyword=hdmi&limit=50&newest=250&order=desc&page_type=search
Then you can use it to getting all the info that you need. In your case, it's URL to the product, but there is no direct url, you must slugify the name. For example [Spot is sold very well]Micro USB para HDMI Adaptador MHL para HDMI 1080 P
to -Spot-is-sold-very-well-Micro-USB-para-HDMI-Adaptador-MHL-para-HDMI-1080-P-HD-TV-
and add 2 ids: shopid
, itemid
as you can see there is a difference between names - but it works
Upvotes: 1