Andrew
Andrew

Reputation: 1575

Cannot scrape data from second page with scrapy

I am having problems to scrape more than one page of data. In splash console, I managed to get 2-3 pages of HTML content. When in the Lua script in the first loop I define to iterate one time to extract one page I get 50 urls. If 2 or more iterations, no data is being returned. In console I get:

Ignoring response <504 https://shopee.sg/search?keyword=hdmi>: HTTP status code is not handled or not allowed or

504 Gateway Time-out

Here is my code

class Shopee(scrapy.Spider):
  name = 'shopee'

  script = '''
    function main(splash, args)
      assert(splash:go(args.url))
      assert(splash:wait(5.0))
      treat=require('treat')
      result = {}
      pages = splash:select('.shopee-mini-page-controller__total')

      for i=1,3,1 do
        for j=1,2,1 do
          assert(splash:runjs("window.scrollBy(0, 1300)"))
          assert(splash:wait(5.0))
        end

        result[i]=splash:html()
        assert(splash:runjs('document.querySelector(".shopee-icon-button--right").click()'))
        assert(splash:wait(8.0))
      end
      return treat.as_array(result)
    end
  '''

  def start_requests(self):
    urls = [
        'https://shopee.sg/search?keyword=hdmi'
    ]
    for link in urls:
      yield SplashRequest(url=link, callback=self.parse, endpoint='execute', args={'wait': 2.5, 'lua_source' : self.script}, dont_filter=True)


  def parse(self, response):
    for page in response.data:
      sel = Selector(text=page)
      yield {
        'urls': sel.xpath("//div[contains(@class, 'shopee-search-item-result__item')]//a[*]/@href").getall()
      }

Upvotes: 0

Views: 409

Answers (1)

amarynets
amarynets

Reputation: 1815

I think that you get a timeout error because of your lua script. When you make a request from the spider, the time for receive response begins. In your lua script you have the following: Run js twice for scrolling, its take some time Twice call function splash:wait(5.0) for download and render some data Then you call assert(splash:wait(8.0))

Final minimum time: (3 * 8) + (2 * 5) + time to run splash:runjs and some other thing

But in your case Splash isn't required. You can make a request for the next page directly from your spider. Chrome->Dev Tools->Network->XHR, there you will find request url https://shopee.sg/api/v2/search_items/?by=relevancy&keyword=hdmi&limit=50&newest=250&order=desc&page_type=search

Then you can use it to getting all the info that you need. In your case, it's URL to the product, but there is no direct url, you must slugify the name. For example [Spot is sold very well]Micro USB para HDMI Adaptador MHL para HDMI 1080 P to -Spot-is-sold-very-well-Micro-USB-para-HDMI-Adaptador-MHL-para-HDMI-1080-P-HD-TV- and add 2 ids: shopid, itemid as you can see there is a difference between names - but it works

Upvotes: 1

Related Questions