Tim
Tim

Reputation: 201

Click display button in Scrapy-Splash

I am scraping the following webpage using scrapy-splash, http://www.starcitygames.com/buylist/, which I have to login to, to get the data I need. That works fine but in order to get the data I need to click the display button so I can scrape that data, the data I need is not accessible until the button is clicked. I already got an answer to this that told me I cannot simply click the display button and scrape the data that shows up and that I need to scrape the JSON webpage associated with that information but I am concerned that scraping the JSON instead will be a red flag to the owners of the site since most people do not open the JSON data page and it would take a human several minutes to find it versus the computer which would be much faster. So I guess my question is, is there anyway to scrape the webpage my clicking display and going from there or do I have no choice but to scrape the JSON page? This is what I have got so far... but it is not clicking the button.

import scrapy
from ..items import NameItem

class LoginSpider(scrapy.Spider):
    name = "LoginSpider"
    start_urls = ["http://www.starcitygames.com/buylist/"]

    def parse(self, response):
        return scrapy.FormRequest.from_response(
        response,
        formcss='#existing_users form',
        formdata={'ex_usr_email': '[email protected]', 'ex_usr_pass': 'password'},
        callback=self.after_login
        )



    def after_login(self, response):
        item = NameItem()
        display_button = response.xpath('//a[contains(., "Display>>")]/@href').get()

        yield response.follow(display_button, self.parse)

        item["Name"] = response.css("div.bl-result-title::text").get()
        return item

Snapshot of website HTML COde

Upvotes: 5

Views: 2239

Answers (2)

GmrYael
GmrYael

Reputation: 382

I have tried to emulate the click with scrapy-splash, making use of lua script. It works, you just have to integrate it with scrapy and to manipulate the content. I leave the script, in which I finish integrating it with scrapy.

function main(splash)
  local url = 'https://www.starcitygames.com/login'
  assert(splash:go(url))
  assert(splash:wait(0.5))
  assert(splash:runjs('document.querySelector("#ex_usr_email_input").value = "[email protected]"'))
  assert(splash:runjs('document.querySelector("#ex_usr_pass_input").value = "your_password"'))
  splash:wait(0.5)
  assert(splash:runjs('document.querySelector("#ex_usr_button_div button").click()'))
  splash:wait(3)
  splash:go('https://www.starcitygames.com/buylist/')
  splash:wait(2)
  assert(splash:runjs('document.querySelectorAll(".bl-specific-name")[1].click()'))
  splash:wait(1)
  assert(splash:runjs('document.querySelector("#bl-search-category").click()'))
  splash:wait(3)
  splash:set_viewport_size(1200,2000)
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

enter image description here

Upvotes: 3

Kamoo
Kamoo

Reputation: 872

You can use the developer tools of your browser to track the request of that click event, which is in a nice JSON format, also no need for cookie (login):

http://www.starcitygames.com/buylist/search?search-type=category&id=5061

The only thing need to fill is the category_id related to this request, this can be extracted from the HTML and declared in your code.

Category name:

//*[@id="bl-category-options"]/option/text()

Category id:

//*[@id="bl-category-options"]/option/@value

Working with JSON is much more simple than parsing HTML.

Upvotes: 7

Related Questions