Reputation: 330
I have the following code in my scrapy spider:
def parse(self, response):
jsonresponse = json.loads(response.body_as_unicode())
htmldata = jsonresponse["html"]
for sel in htmldata.xpath('//li/li'):
-- more xpath codes --
yield item
But i am having this error:
raise ValueError("No JSON object could be decoded")
exceptions.ValueError: No JSON object could be decoded
After checking the json reply, i found out about **<!--WPJM-->**
and **<!--WPJM_END-->**
which is causing this error.
<!--WPJM-->{"found_jobs":true,"html":"<html code>","max_num_pages":3}<!--WPJM_END-->
How do i parse my scrapy without looking at the !--WPJM-- and !--WPJM_END-- code?
EDIT: This is the error that i have:
File "/home/muhammad/Projects/project/project/spiders/crawler.py", line 150, in parse for sel in htmldata.xpath('//li'): exceptions.AttributeError: 'unicode' object has no attribute 'xpath'
def parse(self, response):
rawdata = response.body_as_unicode()
jsondata = rawdata.replace('<!--WPJM-->', '').replace('<!--WPJM_END-->', '')
# print jsondata # For debugging
# pass
data = json.loads(jsondata)
htmldata = data["html"]
# print htmldata # For debugging
# pass
for sel in htmldata.xpath('//li'):
item = ProjectjomkerjaItem()
item['title'] = sel.xpath('a/div[@class="position"]/div[@id="job-title-job-listing"]/strong/text()').extract()
item['company'] = sel.xpath('a/div[@class="position"]/div[@class="company"]/strong/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
Upvotes: 2
Views: 1402
Reputation: 473763
The easiest approach would be to get rid of the comments tags manually using replace()
:
data = response.body_as_unicode()
data = data.replace('<!--WPJM-->', '').replace('<!--WPJM_END-->', '')
jsonresponse = json.loads(data)
Though it is not quite pythonic and reliable.
Or, a better option would to be to get the text()
by xpath:
$ scrapy shell index.html
>>> response.xpath('//text()').extract()[0]
u'{"found_jobs":true,"html":"<html code"}'
Upvotes: 1