AaronS
AaronS

Reputation: 2335

Scrapy and redirection

So I've been working on building a web scraper with scrapy and being going through some data validation to make sure that all the items have been correctly grabbed.

I'm using it to grab medium data on articles title, names, claps, responses etc which are all on one html page. But to get the tags, I needed to go to each individual article. Most of the tags I've managed to get but there are a couple of articles which are linked to towardsdatascience.com instead of the medium website. It does a weird little redirect where the link to the article is say for example

https://towardsdatascience.com/cython-a-speed-up-tool-for-your-python-function-9bab64364bfd?source=tag_archive---------1-----------------------

It then redirects to: https://towardsdatascience.com/cython-a-speed-up-tool-for-your-python-function-9bab64364bfd

Now I've noticed on the articles that redirect to the towardsdatascience page it doesn't grab the tags of the pages. The tag css selector is exactly the same as the other medium articles it grabs.

When I go onto the scrapy shell and try fetch one of the articles that links towardsdatascience article I get this response.

fetch("https://towardsdatascience.com/cython-a-speed-up-tool-for-your-python-function-9bab64364bfd? 
source=tag_archive---------1-----------------------")



**OUTPUT**

2020-02-16 11:52:31 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows 
NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36

2020-02-16 11:52:31 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET 
https://medium.com/m/global-identity?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fcython-a- 
speed-up-tool-for-your-python-function-9bab64364bfd%3Fsource%3Dtag_archive---------1----------------- 
------> from <GET https://towardsdatascience.com/cython-a-speed-up-tool-for-your-python-function- 
9bab64364bfd?source=tag_archive---------1----------------------->

2020-02-16 11:52:31 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET 
https://medium.com/m/global-identity?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fcython-a- 
speed-up-tool-for-your-python-function-9bab64364bfd%3Fsource%3Dtag_archive---------1----------------- 
------>

The robots.txt file is here

User-Agent: *
Disallow: /m/
Disallow: /me/
Disallow: /@me$
Disallow: /@me/
Disallow: /*/edit$
Disallow: /*/*/edit$
Disallow: /r/
Disallow: /t/
Disallow: /search?q$
Disallow: /search?q=
Allow: /_/
Allow: /_/api/users/*/meta
Allow: /_/api/users/*/profile/stream
Allow: /_/api/posts/*/responses
Allow: /_/api/posts/*/responsesStream
Allow: /_/api/posts/*/related
Sitemap: https://towardsdatascience.com/sitemap/sitemap.xml

I've tried a few ways to work with redirects using scrapy from this website and not had any success. Here's the code for the actual crawler.

CODE

 import scrapy
 from dateutil.parser import parse
 from medium.items import MediumItem
 from scrapy.spiders import CrawlSpider

 class DataSpider(CrawlSpider):

    name = 'data'
    allowed_domains = ['medium.com', 'towardsdatascience.com']
    start_urls = ['https://medium.com/tag/python/archive/2020/02/01']

    def parse(self,response):


    articles = response.xpath('//div[@class="postArticle postArticle--short js-postArticle js- 
    trackPostPresentation js-trackPostScrolls"]')

    for article in articles:

        item = MediumItem()

        if article.css("div > h3::text").extract_first():
             item['Title'] = article.css("div > h3::text").extract_first()

             item['Name'] = article.xpath('.//a[@class="ds-link ds-link--styleSubtle link link-- 
             darken link--accent u-accentColor--textNormal u-accentColor--  
             textDarken"]/text()').extract_first()

             item['Date'] = parse(article.css('time::text').extract_first()).date()

             item['Read'] = article.css("span::attr(title)").extract_first()

             item['Publication'] = article.xpath('.//a[@class="ds-link ds-link--styleSubtle link-- 
             darken link--accent u-accentColor--textNormal"]/text()').extract_first()

             item['Claps'] = articles.xpath('.//button[@class="button button--chromeless u-baseColor- 
             -buttonNormal js-multirecommendCountButton u- 
             disablePointerEvents"]/text()').extract_first()

             item['Responses'] = article.xpath('.//a[@class="button button--chromeless u-baseColor-- 
            buttonNormal"]/text()').extract_first()

             link = article.xpath('.//a[@class="button button--smaller button--chromeless u- 
             baseColor--buttonNormal"]/@href').extract_first()

            yield response.follow(link, callback=self.get_link, meta={'item':item})


  def get_link(self,response):
        item = response.meta['item']
        item['Tags'] = response.css("ul > li > a::text").getall()
        yield item

Any help to get the tags from those pages like the one linked would be great.

Upvotes: 0

Views: 816

Answers (1)

ThePyGuy
ThePyGuy

Reputation: 1035

Thanks to @furas comment. They have the correct answer, but I want the rep. :)

# settings.py

ROBOTSTXT_OBEY = False

Upvotes: 2

Related Questions