Follow news links with scrapy

Question

I am new with crawl and scrapy, I am trying to extract some news from https://www.lacuarta.com/, also just news that match the tag san-valentin.

The webpage is just the titles with an image of the news, if you want to read it you have to click on the news and it will take ypu to the page of the story (https://www.lacuarta.com/etiqueta/san-valentin/)

So, I am thinking mi steps are:

Go to the page that matches the tag I want, in this case san-valentin
Extract the urls from the news
Go to the page of the news
Extract the data I want

I already have the points 1 and 2:

import scrapy

class SpiderTags(scrapy.Spider):
    name = "SpiderTags"

    def start_requests(self):
        url = 'https://www.lacuarta.com/etiqueta/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'etiqueta/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for url in response.css("h4.normal a::attr(href)"):
            yield{
                "link:": url.get()
            }

Up to here I have the links to the news, now I can't figure out how to enter in that news for extracting the data I want and then returning to my original webpage to go page number 2 and repeat everything

PD: the info I want already know how to get it

Title: response.css("title::text").get()
Story: response.css("div.col-md-11 p::text").getall()
Author: response.css("div.col-sm-6 h4 a::text").getall()
Date: response.css("div.col-sm-6 h4 small span::text").getall()

malberts · Accepted Answer

You need to yield a new Request in order to follow the link. For example:

def parse(self, response):
    for url in response.css("h4.normal a::attr(href)"):
        # This will get the URL value, not follow it:
        # yield{
        #     "link:": url.get()
        # }
        # This will follow the URL:
        yield scrapy.Request(url.get(), self.parse_news_item)

def parse_news_item(self, response):
    # Extract things from the news item page.
    yield {
        'Title': response.css("title::text").get(),
        'Story': response.css("div.col-md-11 p::text").getall(),
        'Author': response.css("div.col-sm-6 h4 a::text").getall(),
        'Date': response.css("div.col-sm-6 h4 small span::text").getall(),
    }

Follow news links with scrapy

Answers (2)

Related Questions