Reputation: 2459
I am trying to scrape a website that has some info about 15 articles on each page. For each article, I'd like to get the title
, date
, and then follow the "Read More" link
to get additional information (e.g., the source
of the article).
So far, I could successfully scrape the title
and date
for each article in all pages and store them in a CSV file.
My problem is that I couldn't follow the Read More
link to get the additional info (source
) for each article. I read a lot of similar questions and their answers, but I could not fix it yet.
Here is my code:
import scrapy
class PoynterFakenewsSpider(scrapy.Spider):
name = 'Poynter_FakeNews'
allowed_domains = ['poynter.org']
start_urls = ['https://www.poynter.org/ifcn-covid-19-misinformation//']
custom_settings={ 'FEED_URI': "crawlPoynter_%(time)s.csv", 'FEED_FORMAT': 'csv'}
def parse(self, response):
print("procesing:"+response.url)
Title = response.xpath('//h2[@class="entry-title"]/a/text()').extract()
Date = response.xpath('//p[@class="entry-content__text"]/strong/text()').extract()
ReadMore_links = response.xpath('//a[@class="button entry-content__button entry-content__button--smaller"]/@href').extract()
for link in ReadMore_links:
yield scrapy.Request(response.urljoin(links, callback=self.parsepage2)
def parsepage2(self, response):
Source = response.xpath('//p[@class="entry-content__text entry-content__text--smaller"]/text()').extract_first()
return Source
row_data = zip(Title, Date, Source)
for item in row_data:
scraped_info = {
'page':response.url,
'Title': item[0],
'Date': item[1],
'Source': item[2],
}
yield scraped_info
next_page = response.xpath('//a[@class="next page-numbers"]/@href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
Upvotes: 0
Views: 371
Reputation: 67
You might want to have look at the follow_all function, it is a better option than urljoin:
https://docs.scrapy.org/en/latest/intro/tutorial.html#more-examples-and-patterns
Upvotes: 0
Reputation: 10666
You need to process each article, get Date, Title and "Read More" and next yield
another scrapy.Request
passing already collected information using cb_kwargs
(or request.meta
in old versions):
import scrapy
class PoynterFakenewsSpider(scrapy.Spider):
name = 'Poynter_FakeNews'
allowed_domains = ['poynter.org']
start_urls = ['https://www.poynter.org/ifcn-covid-19-misinformation//']
custom_settings={ 'FEED_URI': "crawlPoynter_%(time)s.csv", 'FEED_FORMAT': 'csv'}
def parse(self, response):
for article in response.xpath('//article'):
Title = article.xpath('.//h2[@class="entry-title"]/a/text()').get()
Date = article.xpath('.//p[@class="entry-content__text"]/strong/text()').get()
ReadMore_link = article.xpath('.//a[@class="button entry-content__button entry-content__button--smaller"]/@href').get()
yield scrapy.Request(
url=response.urljoin(ReadMore_link),
callback=self.parse_article_details,
cb_kwargs={
'article_title': Title,
'article_date': Date,
}
)
next_page = response.xpath('//a[@class="next page-numbers"]/@href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
def parse_article_details(self, response, article_title, article_date):
Source = response.xpath('//p[@class="entry-content__text entry-content__text--smaller"]/text()').extract_first()
scraped_info = {
'page':response.url,
'Title': article_title,
'Date': article_date,
'Source': Source,
}
yield scraped_info
UPDATE Everything works correctly on my side:
2020-05-14 00:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=japanese-schools-re-opened-then-were-closed-again-due-to-a-second-wave-of-coronavirus>
{'page': 'https://www.poynter.org/?ifcn_misinformation=japanese-schools-re-opened-then-were-closed-again-due-to-a-second-wave-of-coronavirus', 'Title': ' Japanese schools re-opened then were closed again due to a second wave of coronavirus.', 'Date': '2020/05/12 | France', 'Source': "This false claim originated from: CGT Educ'Action", 'files': []}
2020-05-14 00:59:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19>
{'page': 'https://www.poynter.org/?ifcn_misinformation=famous-french-blue-cheese-roquefort-is-a-medecine-against-covid-19', 'Title': ' Famous French blue cheese, roquefort, is a “medecine against Covid-19”.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents>
{'page': 'https://www.poynter.org/?ifcn_misinformation=administrative-documents-french-people-need-to-fill-to-go-out-are-a-copy-paste-from-1940-documents', 'Title': ' Administrative documents French people need to fill to go out are a copy paste from 1940 documents.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: IndignezVous', 'files': []}
2020-05-14 00:59:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable>
{'page': 'https://www.poynter.org/?ifcn_misinformation=spanish-and-french-masks-prices-are-comparable', 'Title': ' Spanish and French masks prices are comparable.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown>
{'page': 'https://www.poynter.org/?ifcn_misinformation=french-president-macron-and-its-spouse-are-jetskiing-during-the-lockdown', 'Title': ' French President Macron and its spouse are jetskiing during the lockdown.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
2020-05-14 00:59:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air> (referer: https://www.poynter.org/ifcn-covid-19-misinformation/)
2020-05-14 00:59:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air>
{'page': 'https://www.poynter.org/?ifcn_misinformation=french-minister-of-justice-nicole-belloubet-threathened-the-famous-anchor-jean-pierre-pernaut-after-he-criticized-the-government-policy-about-the-pandemic-on-air', 'Title': ' French Minister of Justice Nicole Belloubet threathened the famous anchor Jean-Pierre Pernaut after he criticized the government policy about the pandemic on air.', 'Date': '2020/05/12 | France', 'Source': 'This false claim originated from: Facebook user', 'files': []}
Upvotes: 2