Reputation: 390
Introduction
My crawler finally manages to login, but it wont do any scraping and i cant find the reason for it. My consoleoutput shows no error and i did a crawler for following every internal links some weeks ago, so i thought about, i just have to build my crawler almost identical, but yeah, here i am :>
My XPath expressions should be/have to be correct, because i started to "learn" crawling at this domain
I thought i have to choose CrawlSpider
instead of scrapy.Spider
but if i change my rules to Linkextractor(r'/a-')
<-every productlink contains that "a-"-Tag, i get an error, so i tried to go without rules.
My Code
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request, FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.linkextractors import LinkExtractor
from ..items import ScrapyloginItem
class TopartLoginIn(CrawlSpider):
name = "test123"
allowed_domains = ['topart-online.com']
login_page = 'https://www.topart-online.com/de/Login/p-Login'
start_urls = ['https://www.topart-online.com/']
rules = (
Rule(
LinkExtractor(),
callback='parse_page',
follow=True
),
)
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
def login(self, response):
return FormRequest.from_response(
response,
formdata={
'ff_4d4d375f4c6f67696e5f55736572' : 'not real',
'ff_4d4d375f4c6f67696e5f50617373' : 'login data',
'ff_4d4d375f4c6f67696e' : ""
},
callback=self.parse_page)
def after_loging(self, response):
open_in_browser(response)
accview = response.xpath('//div[@class="myaccounticons row text-center"]')
if accview:
print('success')
else:
print(':(')
for url in self.start_urls:
yield Request(url=url, callback=self.parse_page)
def parse_page(self, response):
productpage = response.xpath('//button[@class="btn btn-primary col-3 js-qty-up"]')
for a in productpage:
items = ScrapyloginItem()
items['Title'] = response.xpath('//h1[@class="text-center text-md-left mt-0"]/text()').get()
yield items
Here you can see, that the loginproccess is succeeded and also it calls, that the page i got referred to after login, i also no productlink. Thats exactly what i want, im just missing something, that it does the actual crawlprocess.
Upvotes: 1
Views: 231
Reputation: 2564
From what you described, your spider log in, the after_loging
method is called,
the var accview
has some value, so it prints 'success' and it ends there because that's how your code is indented.
Notice that the new requests are only yielded if your accview
var is empty.
def after_loging(self, response):
open_in_browser(response)
accview = response.xpath('//div[@class="myaccounticons row text-center"]')
if accview:
print('success')
else:
print(':(')
for url in self.start_urls:
yield Request(url=url, callback=self.parse_page)
You probably wanted something like this:
def after_loging(self, response):
open_in_browser(response)
accview = response.xpath('//div[@class="myaccounticons row text-center"]')
if accview:
print('success')
else:
print(':(')
for url in self.start_urls: # Notice the indentation here
yield Request(url=url, callback=self.parse_page)
Upvotes: 1