Vartika Singh
Vartika Singh

Reputation: 25

DEBUG: Crawled (404) <GET >

I'm trying to extract data about the various competitions offered by kaggle.

I have tried fetching the data from the website through the shell as well as through the code but failed. I tried adding HTTPERROR_ALLOWED_CODES = [404] to the setting.py and made ROBOTSTXT_OBEY = False, yet the error did not go away.

enter code here

# -*- coding: utf-8 -*-
    import scrapy
    class KaggleSpider(scrapy.Spider):

    name = 'kaggle'
    allowed_domains = ['www.kaggle.com/competitions']
    start_urls = ['https://www.kaggle.com/competitions/']

    def parse(self, response):
        #Extracting the content using css selectors
        titles = response.css('.sc-hpbwTc::text').extract()
        description = response.css('.sc-ekLiME::text').extract()
        rewards = response.css('.sc-jWgUIs::text').extract()
        print(titles)

        #Give the extracted content row wise
        for item in zip(titles,description,rewards):
            #create a dictionary to store the scraped info
            scraped_info = {
                'title' : item[0],
                'vote' : item[1],
                'created_at' : item[2],
            }

            #yield or give the scraped info to scrapy
            yield scraped_info

C:\Users\Vartika Singh\ourfirstscraper1>scrapy crawl kaggle
2019-05-20 00:16:07 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: ourfirstscraper1)
2019-05-20 00:16:07 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b  26 Feb 2019), cryptography 2.6.1, Platform Windows-10-10.0.17763-SP0
2019-05-20 00:16:07 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'ourfirstscraper1', 'NEWSPIDER_MODULE': 'ourfirstscraper1.spiders', 'SPIDER_MODULES': ['ourfirstscraper1.spiders']}
2019-05-20 00:16:07 [scrapy.extensions.telnet] INFO: Telnet Password: 397df34cf4a967c1
2019-05-20 00:16:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-05-20 00:16:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-05-20 00:16:07 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-05-20 00:16:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-05-20 00:16:07 [scrapy.core.engine] INFO: Spider opened
2019-05-20 00:16:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-20 00:16:07 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-05-20 00:16:09 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.kaggle.com/competitions/> (referer: None)
[]
2019-05-20 00:16:09 [scrapy.core.engine] INFO: Closing spider (finished)
2019-05-20 00:16:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 226,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 285,
 'downloader/response_count': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 5, 19, 18, 46, 9, 403376),
 'log_count/DEBUG': 1,
 'log_count/INFO': 9,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 5, 19, 18, 46, 7, 962228)}
2019-05-20 00:16:09 [scrapy.core.engine] INFO: Spider closed (finished)

Upvotes: 2

Views: 2640

Answers (2)

Thiago Curvelo
Thiago Curvelo

Reputation: 3740

To work around the 404, setting an user-agent will do. You can do that in 'settings.py' or in the spider itself:

custom_settings = { 
    'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0; 
}

Besides that, you won't be able to scrape the competitions using those selectors you have. Those elements are created dynamically by some javascript code, after the page load. However, you can find the data you want in a <script> tag. To recover that, you can use a regex with .re_first(). Eg.

def parse(self, response):
    data = json.loads((
        response
        .css(r"script:contains('Kaggle.State.push({\"')")
        .re_first(r'Kaggle.State.push\((.+?)\);')
    ))

    for group in data['fullCompetitionGroups']:
        if group['totalCompetitions'] > 0:
            for competition in group['competitions']:
                yield {
                    'title': competition['competitionTitle'],
                    'description': competition['competitionDescription'],
                    'reward': competition['rewardDisplay'],
                }

Upvotes: 1

Pasindu Gamarachchi
Pasindu Gamarachchi

Reputation: 566

Change the user agent in the settings as well:

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'

Upvotes: 0

Related Questions