Daniel Borysowski
Daniel Borysowski

Reputation: 133

Scrapy spider returns no items data

My scrapy script seems not to follow links, which ends up not extracting data from each of them (to pass some content as scrapy items).

I am trying to scrape a lot of data from a news website. I managed to copy/write a spider that, as I assumed, should read links from a file (I've generated it with another script), put them in start_urls list and start following these links to extract some data, and then pass it as items, and also -- write each item's data in a separate file (last part is actually for another question).

After running scrapy crawl PNS, script goes through all the links from start_urls but does nothing more -- it follows links read from start_urls list (I'm getting "GET link" message in bash), but seems not to enter them and read some more links to follow and extract data from.

import scrapy
import re
from ProjectName.items import ProjectNameArticle

class ProjectNameSpider(scrapy.Spider):

    name = 'PNS'

    allowed_domains = ['www.project-domain.com']

    start_urls = []

    with open('start_urls.txt', 'r') as file:
        for line in file:
            start_urls.append(line.strip())

    def parse(self, response):
        for link in response.css('div.news-wrapper_ h3.b-item__title a').xpath('@href').extract():
            # extracted links look like this: "/document.html"
            link = "https://project-domain.com" + link
            yield scrapy.Request(link, callback=self.parse_news)

    def parse_news(self, response):

        data_dic = ProjectNameArticle() 

        data_dic['article_date'] =  response.css('div.article__date::text').extract_first().strip()
        data_dic['article_time'] =  response.css('span.article__time::text').extract_first().strip()
        data_dic['article_title'] = response.css('h3.article__title::text').extract_first().strip()
        news_text =  response.css('div.article__text').extract_first()
        news_text =  re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)', '', news_text).strip()
        data_dic['article_text'] = news_text
        return data_dic

Expected result:

  1. Script opens start_urls.txt file, reads its lines (every line contains a single link), puts these links to start_urls list,
  2. For each link opened spider extracts deeper links to be followed (that's about 50-200 links for each start_urls link),
  3. Followed links are the main target from which I want to extract specific data: article title, date, time, text.
  4. For now never mind writing each scrapy item to a distinc .txt file.

Actual result:

  1. Running my spider triggers GET for each start_urls link, goes through around 150000, doesn't create a list of deeper links, nor enters them to extract any data.

Upvotes: 0

Views: 594

Answers (1)

Umair Ayub
Umair Ayub

Reputation: 21201

Dude, I have been coding in Python Scrapy for long time and I hate using start_urls

You can simply use start_requests which is very easy to read, and also very easy to learn for beginners

class ProjectNameSpider(scrapy.Spider):

    name = 'PNS'

    allowed_domains = ['www.project-domain.com']

    def start_requests(self):

        with open('start_urls.txt', 'r') as file:
            for line in file:
                yield Request(line.strip(), 
                    callback=self.my_callback_func)

    def my_callback_func(self, response):
        for link in response.css('div.news-wrapper_ h3.b-item__title a').xpath('@href').extract():
            # extracted links look like this: "/document.html"
            link = "https://project-domain.com" + link
            yield scrapy.Request(link, callback=self.parse_news)

    def parse_news(self, response):

        data_dic = ProjectNameArticle() 

        data_dic['article_date'] =  response.css('div.article__date::text').extract_first().strip()
        data_dic['article_time'] =  response.css('span.article__time::text').extract_first().strip()
        data_dic['article_title'] = response.css('h3.article__title::text').extract_first().strip()
        news_text =  response.css('div.article__text').extract_first()
        news_text =  re.sub(r'(<script(\s|\S)*?<\/script>)|(<style(\s|\S)*?<\/style>)|(<!--(\s|\S)*?-->)|(<\/?(\s|\S)*?>)', '', news_text).strip()
        data_dic['article_text'] = news_text
        return data_dic

I also have never used Item class and find it useless too

You can simply have data_dic = {} instead of data_dic = ProjectNameArticle()

Upvotes: 3

Related Questions