Reputation: 11
As part of my studies about python, I try to scrape mail.ru main page the news lines
I have allowed to crawl, I added custom user agent. I have different xpath location, cant get anything, just empty list.
import scrapy
class TestmailspidetSpider(scrapy.Spider):
name = 'testmailspidet'
allowed_domains = ['mail.ru']
start_urls = ['http://mail.ru/']
def parse(self, response):
yield {
'testing':response.xpath('//span[@class="i-link-deco i-inline"][position()=1]').extract_first()
}
Upvotes: 1
Views: 483
Reputation: 2116
It's forbidden by robots.txt (https://mail.ru/robots.txt). If you still want to scrape it, you'll have to set ROBOTSTXT_OBEY to False. You can include this as follows:
custom_settings = {
'ROBOTSTXT_OBEY': False,
}
Furthermore, the xpath doesn't give any results - probably because the content is loaded dynamically. You can check with scrapy shell what the html-page looks like that scrapy sees like this: scrapy shell -s ROBOTSTXT_OBEY=False "http://mail.ru/"
.
The xpath that gets you the titles can then be constructed as follows: //*[@id="news:main:list"]//*[@class="news__list__item__link__text"]/text()
.
Upvotes: 1