muhammadn
muhammadn

Reputation: 330

Processing JSON Response using scrapy

I have the following code in my scrapy spider:

def parse(self, response):
         jsonresponse = json.loads(response.body_as_unicode())
         htmldata = jsonresponse["html"]
         for sel in htmldata.xpath('//li/li'):
                 -- more xpath codes --
         yield item

But i am having this error:

    raise ValueError("No JSON object could be decoded")
exceptions.ValueError: No JSON object could be decoded

After checking the json reply, i found out about **<!--WPJM-->** and **<!--WPJM_END-->** which is causing this error.

<!--WPJM-->{"found_jobs":true,"html":"<html code>","max_num_pages":3}<!--WPJM_END-->

How do i parse my scrapy without looking at the !--WPJM-- and !--WPJM_END-- code?

EDIT: This is the error that i have:

File "/home/muhammad/Projects/project/project/spiders/crawler.py", line 150, in parse for sel in htmldata.xpath('//li'): exceptions.AttributeError: 'unicode' object has no attribute 'xpath'

    def parse(self, response):
        rawdata = response.body_as_unicode()
        jsondata = rawdata.replace('<!--WPJM-->', '').replace('<!--WPJM_END-->', '')
#       print jsondata # For debugging
#       pass 
        data = json.loads(jsondata)
        htmldata = data["html"]
#       print htmldata # For debugging
#       pass
        for sel in htmldata.xpath('//li'):
           item = ProjectjomkerjaItem()
           item['title'] = sel.xpath('a/div[@class="position"]/div[@id="job-title-job-listing"]/strong/text()').extract()
           item['company'] = sel.xpath('a/div[@class="position"]/div[@class="company"]/strong/text()').extract()
           item['link'] = sel.xpath('a/@href').extract()

Upvotes: 2

Views: 1402

Answers (1)

alecxe
alecxe

Reputation: 473763

The easiest approach would be to get rid of the comments tags manually using replace():

data = response.body_as_unicode()
data = data.replace('<!--WPJM-->', '').replace('<!--WPJM_END-->', '')
jsonresponse = json.loads(data)

Though it is not quite pythonic and reliable.

Or, a better option would to be to get the text() by xpath:

$ scrapy shell index.html
>>> response.xpath('//text()').extract()[0]
u'{"found_jobs":true,"html":"<html code"}'

Upvotes: 1

Related Questions