pruton
pruton

Reputation: 11

getting None from scrapy

As part of my studies about python, I try to scrape mail.ru main page the news lines

I have allowed to crawl, I added custom user agent. I have different xpath location, cant get anything, just empty list.

import scrapy

class TestmailspidetSpider(scrapy.Spider):
    name = 'testmailspidet'
    allowed_domains = ['mail.ru']
    start_urls = ['http://mail.ru/']

    def parse(self, response):
   
    yield {
        'testing':response.xpath('//span[@class="i-link-deco i-inline"][position()=1]').extract_first()
    }

Upvotes: 1

Views: 483

Answers (1)

Wim Hermans
Wim Hermans

Reputation: 2116

It's forbidden by robots.txt (https://mail.ru/robots.txt). If you still want to scrape it, you'll have to set ROBOTSTXT_OBEY to False. You can include this as follows:

custom_settings = {
    'ROBOTSTXT_OBEY': False,
}

Furthermore, the xpath doesn't give any results - probably because the content is loaded dynamically. You can check with scrapy shell what the html-page looks like that scrapy sees like this: scrapy shell -s ROBOTSTXT_OBEY=False "http://mail.ru/". The xpath that gets you the titles can then be constructed as follows: //*[@id="news:main:list"]//*[@class="news__list__item__link__text"]/text().

Upvotes: 1

Related Questions