Reputation: 2335
So I've been working on building a web scraper with scrapy and being going through some data validation to make sure that all the items have been correctly grabbed.
I'm using it to grab medium data on articles title, names, claps, responses etc which are all on one html page. But to get the tags, I needed to go to each individual article. Most of the tags I've managed to get but there are a couple of articles which are linked to towardsdatascience.com instead of the medium website. It does a weird little redirect where the link to the article is say for example
It then redirects to: https://towardsdatascience.com/cython-a-speed-up-tool-for-your-python-function-9bab64364bfd
Now I've noticed on the articles that redirect to the towardsdatascience page it doesn't grab the tags of the pages. The tag css selector is exactly the same as the other medium articles it grabs.
When I go onto the scrapy shell and try fetch one of the articles that links towardsdatascience article I get this response.
fetch("https://towardsdatascience.com/cython-a-speed-up-tool-for-your-python-function-9bab64364bfd?
source=tag_archive---------1-----------------------")
**OUTPUT**
2020-02-16 11:52:31 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows
NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36
2020-02-16 11:52:31 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET
https://medium.com/m/global-identity?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fcython-a-
speed-up-tool-for-your-python-function-9bab64364bfd%3Fsource%3Dtag_archive---------1-----------------
------> from <GET https://towardsdatascience.com/cython-a-speed-up-tool-for-your-python-function-
9bab64364bfd?source=tag_archive---------1----------------------->
2020-02-16 11:52:31 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET
https://medium.com/m/global-identity?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fcython-a-
speed-up-tool-for-your-python-function-9bab64364bfd%3Fsource%3Dtag_archive---------1-----------------
------>
The robots.txt file is here
User-Agent: *
Disallow: /m/
Disallow: /me/
Disallow: /@me$
Disallow: /@me/
Disallow: /*/edit$
Disallow: /*/*/edit$
Disallow: /r/
Disallow: /t/
Disallow: /search?q$
Disallow: /search?q=
Allow: /_/
Allow: /_/api/users/*/meta
Allow: /_/api/users/*/profile/stream
Allow: /_/api/posts/*/responses
Allow: /_/api/posts/*/responsesStream
Allow: /_/api/posts/*/related
Sitemap: https://towardsdatascience.com/sitemap/sitemap.xml
I've tried a few ways to work with redirects using scrapy from this website and not had any success. Here's the code for the actual crawler.
CODE
import scrapy
from dateutil.parser import parse
from medium.items import MediumItem
from scrapy.spiders import CrawlSpider
class DataSpider(CrawlSpider):
name = 'data'
allowed_domains = ['medium.com', 'towardsdatascience.com']
start_urls = ['https://medium.com/tag/python/archive/2020/02/01']
def parse(self,response):
articles = response.xpath('//div[@class="postArticle postArticle--short js-postArticle js-
trackPostPresentation js-trackPostScrolls"]')
for article in articles:
item = MediumItem()
if article.css("div > h3::text").extract_first():
item['Title'] = article.css("div > h3::text").extract_first()
item['Name'] = article.xpath('.//a[@class="ds-link ds-link--styleSubtle link link--
darken link--accent u-accentColor--textNormal u-accentColor--
textDarken"]/text()').extract_first()
item['Date'] = parse(article.css('time::text').extract_first()).date()
item['Read'] = article.css("span::attr(title)").extract_first()
item['Publication'] = article.xpath('.//a[@class="ds-link ds-link--styleSubtle link--
darken link--accent u-accentColor--textNormal"]/text()').extract_first()
item['Claps'] = articles.xpath('.//button[@class="button button--chromeless u-baseColor-
-buttonNormal js-multirecommendCountButton u-
disablePointerEvents"]/text()').extract_first()
item['Responses'] = article.xpath('.//a[@class="button button--chromeless u-baseColor--
buttonNormal"]/text()').extract_first()
link = article.xpath('.//a[@class="button button--smaller button--chromeless u-
baseColor--buttonNormal"]/@href').extract_first()
yield response.follow(link, callback=self.get_link, meta={'item':item})
def get_link(self,response):
item = response.meta['item']
item['Tags'] = response.css("ul > li > a::text").getall()
yield item
Any help to get the tags from those pages like the one linked would be great.
Upvotes: 0
Views: 816
Reputation: 1035
Thanks to @furas comment. They have the correct answer, but I want the rep. :)
# settings.py
ROBOTSTXT_OBEY = False
Upvotes: 2