Reputation: 25
I'm new to Scrapy and I need to scrape a page and I'm having trouble crawling the page to be scraped.
Without filling any of the fields on the page, and clicking the "PESQUISAR" (translate: search) button directly, I need to scrape all the pages shown below.
It looks like my problem is in the page javascript .. and I've never worked with javascript.
from scrapy import Spider
from selenium import webdriver
from scrapy.selector import Selector
class CarfSpider(Spider):
name = 'carf'
allowed_domains = ['example.com']
def start_requests(self):
self.driver = webdriver.Chrome('/Users/Desktop/chromedriver')
self.driver.get('example.com')
sel = Selector(text=self.driver.page_source)
carf = sel.xpath('//*[@id="botaoPesquisarCarf"]')
My main difficulty is tracking this page. So if anyone can help me with this, I appreciate it.
Sorry for the bad English, I hope you have understood
Upvotes: 0
Views: 1965
Reputation: 260
You have to use driver to click on button Pesquisar, call WebDriverWait
to wait until the table element with id tblJurisprudencia is present indicating that page is fully loaded to get source-code, them parse Acordão values from the page.
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy import Selector
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
class CarfSpider(Spider):
name = 'carf'
start_urls = ['https://carf.fazenda.gov.br/sincon/public/pages/ConsultarJurisprudencia/consultarJurisprudenciaCarf.jsf']
def __init__(self):
self.driver = webdriver.Chrome(executable_path='/home/laerte/chromedriver')
def parse(self, response):
self.driver.get(response.url)
self.driver.find_element_by_id('botaoPesquisarCarf').click()
page_loaded = WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.ID, "tblJurisprudencia"))
)
if page_loaded:
response_selenium = Selector(text=self.driver.page_source)
table = response_selenium.xpath("//table[@id='tblJurisprudencia']")
for row in table.xpath("//tr"):
body = row.xpath("//div[@class='rich-panel-body ']")
yield {
"acordao" : body.xpath("./a/text()").extract_first()
}
Upvotes: 1