beboy
beboy

Reputation: 103

click button on website using scrapy

I want to ask how about (do crawling) clicking next button(change number page of website) (then do crawling more till the end of page number) from this site

I've try to combining scrape with selenium,but its still error and says "line 22 self.driver = webdriver.Firefox() ^ IndentationError: expected an indented block"

I don't know why it happens, i think i code is so well.Anybody can resolve this problem?

This my source :

from selenium import webdriver
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from now.items import NowItem
class MySpider(BaseSpider):
name = "nowhere"
allowed_domains = ["n0where.net"]
start_urls = ["https://n0where.net/"]

def parse(self, response):
    for article in response.css('.loop-panel'):
        item = NowItem()
        item['title'] = article.css('.article-title::text').extract_first()
        item['link'] = article.css('.loop-panel>a::attr(href)').extract_first()
        item['body'] ='' .join(article.css('.excerpt p::text').extract()).strip()
        #item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
        yield item

def __init__(self):
    self.driver = webdriver.Firefox()

    def parse2(self, response):
    self.driver.get(response.url)

    while True:
        next = self.driver.find_element_by_xpath('/html/body/div[4]/div[3]/div/div/div/div/div[1]/div/div[6]/div/a[8]/span')

        try:
            next.click()

            # get the data and write it to scrapy items
        except:
            break

    self.driver.close()`

This my capture of my program mate : capture program

Upvotes: 2

Views: 6579

Answers (2)

Granitosaurus
Granitosaurus

Reputation: 21406

Ignoring the syntax and indentation errors you have an issue with your code logic in general.

What you do is create webdriver and never use it. What your spider does here is:

  1. Create webdriver object.
  2. Schedule a request for every url in self.start_urls, in your case it's only one.
  3. Download it, make Response object and pass it to the self.parse()
  4. Your parse method seems to find some xpaths and makes some items, so scrapy yields you some items that were found if any
  5. Done

Your parse2 was never called and so your selenium webdriver was never used.

Since you are not using scrapy to download anything in this case you can just override start_requests()(<- that's where your spider starts) method of your spider to do the whole logic.

Something like:

from selenium import webdriver
import scrapy
from scrapy import Selector


class MySpider(scrapy.Spider):
    name = "nowhere"
    allowed_domains = ["n0where.net"]
    start_url = "https://n0where.net/"

    def start_requests(self):
        driver = webdriver.Firefox()
        driver.get(self.start_url)
        while True:
            next_url = driver.find_element_by_xpath(
                '/html/body/div[4]/div[3]/div/div/div/div/div[1]/div/div[6]/div/a[8]/span')
            try:
                # parse the body your webdriver has
                self.parse(driver.page_source)
                # click the button to go to next page 
                next_url.click()
            except:
                break
        driver.close()

    def parse(self, body):
        # create Selector from html string
        sel = Selector(text=body)
        # parse it
        for article in sel.css('.loop-panel'):
            item = dict()
            item['title'] = article.css('.article-title::text').extract_first()
            item['link'] = article.css('.loop-panel>a::attr(href)').extract_first()
            item['body'] = ''.join(article.css('.excerpt p::text').extract()).strip()
            # item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
            yield item

Upvotes: 5

Valentin Lorentz
Valentin Lorentz

Reputation: 9753

This is a indentation error. Look the lines near the error:

    def parse2(self, response):
    self.driver.get(response.url)

The first of these two lines ends with a colon. So, the second line should be more indented than the first one.

There are two possible fixes, depending on what you want to do. Either add an indentation level to the second one:

    def parse2(self, response):
        self.driver.get(response.url)

Or move the parse2 function out of theinit` function:

def parse2(self, response):
    self.driver.get(response.url)

def __init__(self):
    self.driver = webdriver.Firefox()

    # etc.

Upvotes: 1

Related Questions