stacker
stacker

Reputation: 25

How to use Selenium with Scrapy for crawling ajax pages

I'm new to Scrapy and I need to scrape a page and I'm having trouble crawling the page to be scraped.

Without filling any of the fields on the page, and clicking the "PESQUISAR" (translate: search) button directly, I need to scrape all the pages shown below.

It looks like my problem is in the page javascript .. and I've never worked with javascript.

from scrapy import Spider
from selenium import webdriver
from scrapy.selector import Selector

class CarfSpider(Spider):
    name = 'carf'
    allowed_domains = ['example.com']

    def start_requests(self):
        self.driver = webdriver.Chrome('/Users/Desktop/chromedriver')
        self.driver.get('example.com')
        sel = Selector(text=self.driver.page_source)
        carf = sel.xpath('//*[@id="botaoPesquisarCarf"]')

My main difficulty is tracking this page. So if anyone can help me with this, I appreciate it.

Sorry for the bad English, I hope you have understood

Upvotes: 0

Views: 1965

Answers (1)

Laerte
Laerte

Reputation: 260

You have to use driver to click on button Pesquisar, call WebDriverWait to wait until the table element with id tblJurisprudencia is present indicating that page is fully loaded to get source-code, them parse Acordão values from the page.

# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy import Selector
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep


class CarfSpider(Spider):

    name = 'carf'
    start_urls = ['https://carf.fazenda.gov.br/sincon/public/pages/ConsultarJurisprudencia/consultarJurisprudenciaCarf.jsf']

    def __init__(self):
        self.driver = webdriver.Chrome(executable_path='/home/laerte/chromedriver')

    def parse(self, response):
        self.driver.get(response.url)

        self.driver.find_element_by_id('botaoPesquisarCarf').click()

        page_loaded = WebDriverWait(self.driver, 10).until(
            EC.presence_of_element_located((By.ID, "tblJurisprudencia"))
        )

        if page_loaded:
            response_selenium = Selector(text=self.driver.page_source)

            table = response_selenium.xpath("//table[@id='tblJurisprudencia']")

            for row in table.xpath("//tr"):
                body = row.xpath("//div[@class='rich-panel-body ']")

                yield {
                    "acordao" : body.xpath("./a/text()").extract_first()
                }

Upvotes: 1

Related Questions