Reputation: 103
I want to ask how about (do crawling) clicking next button(change number page of website) (then do crawling more till the end of page number) from this site
I've try to combining scrape with selenium,but its still error and says "line 22
self.driver = webdriver.Firefox()
^
IndentationError: expected an indented block"
I don't know why it happens, i think i code is so well.Anybody can resolve this problem?
This my source :
from selenium import webdriver
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from now.items import NowItem
class MySpider(BaseSpider):
name = "nowhere"
allowed_domains = ["n0where.net"]
start_urls = ["https://n0where.net/"]
def parse(self, response):
for article in response.css('.loop-panel'):
item = NowItem()
item['title'] = article.css('.article-title::text').extract_first()
item['link'] = article.css('.loop-panel>a::attr(href)').extract_first()
item['body'] ='' .join(article.css('.excerpt p::text').extract()).strip()
#item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
yield item
def __init__(self):
self.driver = webdriver.Firefox()
def parse2(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_xpath('/html/body/div[4]/div[3]/div/div/div/div/div[1]/div/div[6]/div/a[8]/span')
try:
next.click()
# get the data and write it to scrapy items
except:
break
self.driver.close()`
This my capture of my program mate :
Upvotes: 2
Views: 6579
Reputation: 21406
Ignoring the syntax and indentation errors you have an issue with your code logic in general.
What you do is create webdriver and never use it. What your spider does here is:
self.start_urls
, in your case it's only one.Response
object and pass it to the self.parse()
Your parse2 was never called and so your selenium webdriver was never used.
Since you are not using scrapy to download anything in this case you can just override start_requests()
(<- that's where your spider starts) method of your spider to do the whole logic.
Something like:
from selenium import webdriver
import scrapy
from scrapy import Selector
class MySpider(scrapy.Spider):
name = "nowhere"
allowed_domains = ["n0where.net"]
start_url = "https://n0where.net/"
def start_requests(self):
driver = webdriver.Firefox()
driver.get(self.start_url)
while True:
next_url = driver.find_element_by_xpath(
'/html/body/div[4]/div[3]/div/div/div/div/div[1]/div/div[6]/div/a[8]/span')
try:
# parse the body your webdriver has
self.parse(driver.page_source)
# click the button to go to next page
next_url.click()
except:
break
driver.close()
def parse(self, body):
# create Selector from html string
sel = Selector(text=body)
# parse it
for article in sel.css('.loop-panel'):
item = dict()
item['title'] = article.css('.article-title::text').extract_first()
item['link'] = article.css('.loop-panel>a::attr(href)').extract_first()
item['body'] = ''.join(article.css('.excerpt p::text').extract()).strip()
# item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
yield item
Upvotes: 5
Reputation: 9753
This is a indentation error. Look the lines near the error:
def parse2(self, response):
self.driver.get(response.url)
The first of these two lines ends with a colon. So, the second line should be more indented than the first one.
There are two possible fixes, depending on what you want to do. Either add an indentation level to the second one:
def parse2(self, response):
self.driver.get(response.url)
Or move the parse2 function out of the
init` function:
def parse2(self, response):
self.driver.get(response.url)
def __init__(self):
self.driver = webdriver.Firefox()
# etc.
Upvotes: 1