mOna
mOna

Reputation: 2459

Scrapy: follow links to scrape additional information for each item

I am trying to scrape a website that has some info about 15 articles on each page. For each article, I'd like to get the title, date, and then follow the "Read More" link to get additional information (e.g., the source of the article).

So far, I could successfully scrape the title and date for each article in all pages and store them in a CSV file.

My problem is that I couldn't follow the Read More link to get the additional info (source) for each article. I read a lot of similar questions and their answers, but I could not fix it yet.

Here is my code:

import scrapy
class PoynterFakenewsSpider(scrapy.Spider):
    name = 'Poynter_FakeNews'
    allowed_domains = ['poynter.org']
    start_urls = ['https://www.poynter.org/ifcn-covid-19-misinformation//']

    custom_settings={ 'FEED_URI': "crawlPoynter_%(time)s.csv", 'FEED_FORMAT': 'csv'} 

    def parse(self, response):
        print("procesing:"+response.url)
        Title = response.xpath('//h2[@class="entry-title"]/a/text()').extract()
        Date = response.xpath('//p[@class="entry-content__text"]/strong/text()').extract()
        
        ReadMore_links = response.xpath('//a[@class="button entry-content__button entry-content__button--smaller"]/@href').extract()
        for link in ReadMore_links:
        yield scrapy.Request(response.urljoin(links, callback=self.parsepage2)

    def parsepage2(self, response):
        Source = response.xpath('//p[@class="entry-content__text entry-content__text--smaller"]/text()').extract_first()
        return Source 

    row_data = zip(Title, Date, Source)
    for item in row_data:
        scraped_info = {
            'page':response.url,
            'Title': item[0], 
            'Date': item[1],
            'Source': item[2],
        }
        yield scraped_info

     next_page = response.xpath('//a[@class="next page-numbers"]/@href').extract_first()
     if next_page: 
         yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

Upvotes: 0

Views: 371

Answers (2)

Calil
Calil

Reputation: 67

You might want to have look at the follow_all function, it is a better option than urljoin:

https://docs.scrapy.org/en/latest/intro/tutorial.html#more-examples-and-patterns

Upvotes: 0

gangabass
gangabass

Reputation: 10666

You need to process each article, get Date, Title and "Read More" and next yield another scrapy.Request passing already collected information using cb_kwargs (or request.meta in old versions):

import scrapy


class PoynterFakenewsSpider(scrapy.Spider):
    name = 'Poynter_FakeNews'
    allowed_domains = ['poynter.org']
    start_urls = ['https://www.poynter.org/ifcn-covid-19-misinformation//']

    custom_settings={ 'FEED_URI': "crawlPoynter_%(time)s.csv", 'FEED_FORMAT': 'csv'} 

    def parse(self, response):

        for article in response.xpath('//article'):
            Title = article.xpath('.//h2[@class="entry-title"]/a/text()').get()
            Date = article.xpath('.//p[@class="entry-content__text"]/strong/text()').get()
            ReadMore_link = article.xpath('.//a[@class="button entry-content__button entry-content__button--smaller"]/@href').get()

            yield scrapy.Request(
                url=response.urljoin(ReadMore_link), 
                callback=self.parse_article_details,
                cb_kwargs={
                    'article_title': Title,
                    'article_date': Date,
                }
            )
        next_page = response.xpath('//a[@class="next page-numbers"]/@href').extract_first()
        if next_page: 
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

    def parse_article_details(self, response, article_title, article_date):
        Source = response.xpath('//p[@class="entry-content__text entry-content__text--smaller"]/text()').extract_first()
        scraped_info = {
            'page':response.url,
            'Title': article_title, 
            'Date': article_date,
            'Source': Source,
        }
        yield scraped_info

UPDATE Everything works correctly on my side:

2020-05-14 00:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=japanese-schools-re-opened-then-were-closed-again-due-to-a-second-wave-of-coronavirus>
{'page': 'https://www.poynter.org/?ifcn_misinformation=japanese-schools-re-opened-then-were-closed-again-due-to-a-second-wave-of-coronavirus', 'Title': ' Japanese schools re-opened then were closed again due to a second wave of coronavirus.', 'Date': '2020/05/12 | France', 'Source': "This false claim originated from: CGT Educ'Action", 'files': []}
2020-05-14 00:59:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19>
{'page': 'https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19', 'Title': ' Famous French blue cheese, roquefort, is a “medecine against Covid-19”.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents>
{'page': 'https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents', 'Title': ' Administrative documents French people need to fill to go out are a copy paste from 1940 documents.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: IndignezVous', 'files': []}
2020-05-14 00:59:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable>
{'page': 'https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable', 'Title': ' Spanish and French masks prices are comparable.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown>
{'page': 'https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown', 'Title': ' French President Macron and its spouse are jetskiing during the lockdown.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air>
{'page': 'https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air', 'Title': ' French Minister of Justice Nicole Belloubet threathened the famous anchor Jean-Pierre Pernaut after he criticized the government policy about the pandemic on air.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}

Upvotes: 2

Related Questions