kekw
kekw

Reputation: 390

Scrapy logs in, but doesnt crawl anything

Introduction

My crawler finally manages to login, but it wont do any scraping and i cant find the reason for it. My consoleoutput shows no error and i did a crawler for following every internal links some weeks ago, so i thought about, i just have to build my crawler almost identical, but yeah, here i am :>

My XPath expressions should be/have to be correct, because i started to "learn" crawling at this domain

I thought i have to choose CrawlSpider instead of scrapy.Spider but if i change my rules to Linkextractor(r'/a-') <-every productlink contains that "a-"-Tag, i get an error, so i tried to go without rules.

My Code

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request, FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.linkextractors import LinkExtractor
from ..items import ScrapyloginItem

class TopartLoginIn(CrawlSpider):
    name = "test123"
    allowed_domains = ['topart-online.com']
    login_page = 'https://www.topart-online.com/de/Login/p-Login'
    start_urls = ['https://www.topart-online.com/']
    
    rules = (
        Rule(
            LinkExtractor(),
            callback='parse_page',
            follow=True
        ),
    )
    
    def start_requests(self):
        yield Request(
            url=self.login_page,
            callback=self.login,
            dont_filter=True
        )
        
    def login(self, response):
        return FormRequest.from_response(
            response,
            formdata={
                'ff_4d4d375f4c6f67696e5f55736572' : 'not real',
                'ff_4d4d375f4c6f67696e5f50617373' : 'login data',
                'ff_4d4d375f4c6f67696e' : ""                
            },
            callback=self.parse_page)
        
    def after_loging(self, response):
        open_in_browser(response)
        accview = response.xpath('//div[@class="myaccounticons row text-center"]')
        
        if accview:
            print('success')
            
        else:
            print(':(')
            
            for url in self.start_urls:
                yield Request(url=url, callback=self.parse_page)
                        
    def parse_page(self, response):
        productpage = response.xpath('//button[@class="btn btn-primary col-3 js-qty-up"]')

        for a in productpage:
            
            items = ScrapyloginItem()
            items['Title'] = response.xpath('//h1[@class="text-center text-md-left mt-0"]/text()').get()
            yield items

Here you can see, that the loginproccess is succeeded and also it calls, that the page i got referred to after login, i also no productlink. Thats exactly what i want, im just missing something, that it does the actual crawlprocess. consolge

Upvotes: 1

Views: 231

Answers (1)

renatodvc
renatodvc

Reputation: 2564

From what you described, your spider log in, the after_loging method is called, the var accview has some value, so it prints 'success' and it ends there because that's how your code is indented.

Notice that the new requests are only yielded if your accview var is empty.

def after_loging(self, response):
    open_in_browser(response)
    accview = response.xpath('//div[@class="myaccounticons row text-center"]')
    
    if accview:
        print('success')
        
    else:
        print(':(')
        
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse_page)

You probably wanted something like this:

def after_loging(self, response):
    open_in_browser(response)
    accview = response.xpath('//div[@class="myaccounticons row text-center"]')
    if accview:
        print('success')
    else:
        print(':(')

    for url in self.start_urls:  # Notice the indentation here
        yield Request(url=url, callback=self.parse_page)

Upvotes: 1

Related Questions