Vincent
Vincent

Reputation: 111

Scrapy-crawled-200 Referer-None

I'm trying to learn how to use scrapy and python but I'm not an expert at all...
I have an empty file after crawling this page :

so.news.com and I don't understand why...

Here is my code :

import scrapy

class XinhuaSpider(scrapy.Spider):
name = 'xinhua'
allowed_domains = ['xinhuanet.com']
start_urls = ['http://so.news.cn/?keyWordAll=&keyWordOne=%E6%96%B0%E5%86%A0+%E8%82%BA%E7%82%8E+%E6%AD%A6%E6%B1%89+%E7%97%85%E6%AF%92&keyWordIg=&searchFields=1&sortField=0&url=&senSearch=1&lang=cn#search/0/%E6%96%B0%E5%86%A0/1/']

def parse(self, response):
    #titles = response.css('#newsCon > div.newsList > div.news > h2 > a::text').extract()
    #date = response.css('#newsCon > div.newsList > div.news> div > p.newstime > span::text').extract()
    titles = response.xpath("/html/body/div[@id='search-result']/div[@class='resultCnt']/div[@id='resultList']/div[@class='newsListCnt secondlist']/div[@id='newsCon']/div[@class='newsList']/div[@class='news']/h2/a/text()").extract()
    date = response.xpath("/html/body/div[@id='search-result']/div[@class='resultCnt']/div[@id='resultList']/div[@class='newsListCnt secondlist']/div[@id='newsCon']/div[@class='newsList']/div[@class='news']/div[@class='easynews']/p[@class='newstime']/span/text()").extract()
    for item in zip(titles,date):
        scraped_info ={
            "title" : item[0],
            "date"  : item[1],                
        } 
        yield scraped_info

    nextPg = response.xpath("/html/body/div[@id='search-result']/div[@class='resultCnt']/div[@id='pagination']/a[@class='next']/@href").extract()
    if nextPg is not None:
        print(nextPg)

This is the messenage in console:

2020-05-11 00:09:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://so.news.cn/?keyWordAll=&keyWordOne=%E6%96%B0%E5%86%A0+%E8%82%BA%E7%82%8E+%E6%AD%A6%E6%B1%89+%E7%97%85%E6%AF%92&keyWordIg=&searchFields=1&sortField=0&url=&senSearch=1&lang=cn#search/0/%E6%96%B0%E5%86%A0/1/> (referer: None)
[]

Upvotes: 0

Views: 403

Answers (1)

gangabass
gangabass

Reputation: 10666

You need always check page's source code (Ctrl+U) in your browser. Content you see in your browser maybe loaded using XHR Javascript call. Here is code that works for me (I found correct start url using Chrome Developer Console):

import scrapy
import json
import re

class XinhuaSpider(scrapy.Spider):
    name = 'xinhua'
    # allowed_domains = ['xinhuanet.com']
    start_urls = ['http://so.news.cn/getNews?keyWordAll=&keyWordOne=%25E6%2596%25B0%25E5%2586%25A0%2B%25E8%2582%25BA%25E7%2582%258E%2B%25E6%25AD%25A6%25E6%25B1%2589%2B%25E7%2597%2585%25E6%25AF%2592&keyWordIg=&searchFields=1&sortField=0&url=&senSearch=1&lang=cn&keyword=%E6%96%B0%E5%86%A0&curPage=1']

    def parse(self, response):
        data = json.loads(response.body)
        for item in data["content"]["results"]:
            scraped_info ={
                "title" : item['title'],
                "date"  : item['pubtime'],                
            } 
            yield scraped_info

        current_page = data['content']['curPage']
        total_pages = data['content']['pageCount']
        if current_page < total_pages:
            next_page = re.sub(r'curPage=\d+', f"curPage={current_page + 1}", response.url)
            yield scrapy.Request(
                url=next_page,
                callback=self.parse,
            )

Upvotes: 1

Related Questions