dp Audiovisual
dp Audiovisual

Reputation: 57

Scrapy list of links

I am building a spider with scrapy, I want to access in every item in a list and then scrape all the data inside each link. but when I run the spider it doesn´t scrape the data. What I am missing?

from ..items import JobscraperItem
from scrapy.linkextractors import LinkExtractor



class JobscraperSpider(scrapy.Spider):
    name ='jobspider'
    start_urls = ['https://cccc/bolsa/ofertas?oferta=&lugar=&categoria=']

    def parse(self, response):
        job_detail = response.xpath('//div[@class="list"]/div/a')
        yield from response.follow_all(job_detail, self.parse_jobspider)

    def parse(self, response):
        items = JobscraperItem()

        job_title = response.xpath('//h1/text()').extract()
        company = response.xpath('//h2/b/text()').extract()
        company_url = response.xpath('//div[@class="pull-left"]/a/text()').extract()
        description = response.xpath('//div[@class="aviso"]/text()').extract()
        salary = response.xpath('//div[@id="aviso"]/p[1]/text()').extract()
        city = response.xpath('//div[@id="aviso"]/p[2]/text()').extract()
        district = response.xpath('//div[@id="aviso"]/p[5]/text()').extract()
        publication_date = response.xpath('//div[@id="publicado"]/text()').extract()
        apply = response.xpath('//p[@class="text-center"]/b/text()').extract()
        job_type = response.xpath('//div[@id="resumen"]/p[3]/text()').extract()

        items['job_title'] = job_title
        items['company'] = company
        items['company_url'] = company_url
        items['description'] = description
        items['salary'] = salary
        items['city'] = city
        items['district'] = district
        items['publication_date'] = publication_date
        items['apply'] = apply
        items['job_type'] = job_type

        yield items```

Upvotes: 0

Views: 102

Answers (2)

dp Audiovisual
dp Audiovisual

Reputation: 57

   rules = (

        Rule(LinkExtractor(allow=('/bolsa/166',)), follow=True, callback='parse_item'),

    )

I resolve this adding this code to access in every link and scrape the data inside

Upvotes: 0

Cho'Gath
Cho'Gath

Reputation: 448

From what I can see, one of the issues is that you are creating two functions called parse(). Since you are using a self.parse_jobspider in your first parse function, I'm guessing that your second parse function is named incorrectly.

Also, are you sure that the URL in the start_urls is correct? https://cccc/bolsa/ofertas?oferta=&lugar=&categoria= doesn't direct to anywhere which would also explain why data isn't being scraped.

Upvotes: 1

Related Questions