Ashutosh Saboo
Ashutosh Saboo

Reputation: 364

Scrapy Scraper Issue

I am trying to use Scrapy to scrape - www.paytm.com . The website uses AJAX Requests, in the form of XHR to display search results.

I managed to track down the XHR, and the AJAX response is SIMILAR to JSON, but it isn't actually JSON.

This is the link for one of the XHR request - https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6 . If you see the URL correctly, The parameter - page_count - is responsible for showing different pages of results, and the parameter - userQuery - is responsible for the search query that is passed to the website.

Now, if you see the response correctly. It isn't actually JSON, only looks similar to JSON ( I veified it on http://jsonlint.com/ ) . I want to scrape this using SCRAPY ( SCRAPY only because since it is a framework, it would be faster than using other libraries like BeautifulSoup, because using them to create a scraper that scrapes at such a high speed would take a lot effort - That is the only reason why I want to use Scrapy. ) .

Now, This is my snippet of code, that I used to extract the JSON Response from the URL -:

    jsonresponse = json.loads(response.body_as_unicode())
    print json.dumps(jsonresponse, indent=4, sort_keys=True)

On executing the code, it throws me an error stating-:

2015-07-05 12:13:23 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-05 12:13:23 [scrapy] INFO: Optional features available: ssl, http11
2015-07-05 12:13:23 [scrapy] INFO: Overridden settings: {'DEPTH_PRIORITY': 1, 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue', 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue', 'CONCURRENT_REQUESTS': 100}
2015-07-05 12:13:23 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-05 12:13:23 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-05 12:13:23 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-05 12:13:23 [scrapy] INFO: Enabled item pipelines: 
2015-07-05 12:13:23 [scrapy] INFO: Spider opened
2015-07-05 12:13:23 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-05 12:13:23 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-05 12:13:24 [scrapy] DEBUG: Crawled (200) <GET https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6> (referer: None)
2015-07-05 12:13:24 [scrapy] ERROR: Spider error processing <GET https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6> (referer: None)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "Startup App/SCRAPERS/paytmscraper_scrapy/paytmspiderscript.py", line 111, in parse
    jsonresponse = json.loads(response.body_as_unicode())
  File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
2015-07-05 12:13:24 [scrapy] INFO: Closing spider (finished)
2015-07-05 12:13:24 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 343,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 6483,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 5, 6, 43, 24, 733187),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/ValueError': 1,
 'start_time': datetime.datetime(2015, 7, 5, 6, 43, 23, 908135)}
2015-07-05 12:13:24 [scrapy] INFO: Spider closed (finished)

Now, my Question, How do I scrape such a response using Scrapy? If any other code is required, feel free to ask in the comments. I shall willingly give it!

Please provide the entire code related to this. It would be well appreciated! Maybe some manipulation of the JSON Response (from python) (similar to string comparison) would also work for me, if it can help me scrape this!

P.S: I can't modify the JSON Response manually (using hand) every time because this is the response that is given by the website. So, please suggest a programmatic (pythonic) way to do this. Preferably, I want to use Scrapy as my framework.

Upvotes: 0

Views: 967

Answers (3)

wordpressandphpdev
wordpressandphpdev

Reputation: 21

Paytm provides json data , please check

https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles

Catalog pages return .json data , which consist product name , product url , offer price , actual price and image data etc..

How to get data for a category:

In the above url you can see catalog.paytm.com/v1//g/ , its generic for all urls , you need to replace other parts of url in the following format.

menu item > category > subcategory.

where Electronics is menu item and mobile-accessories is category and mobiles is sub category of mobile accessories.

when you run the url in the above format , paytm will return the json data , you can query paytm for more pages with following parameters.

page_count and items_per_count

example: catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles?page_count=2&items_per_count=30

in json data search for grid_layout , if it is not available the page have no items , you can come out of loop , else process json data and read product details.

Upvotes: 2

eozzy
eozzy

Reputation: 68650

Change:

https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6

to:

https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1

.. and you have JSON.

Upvotes: 0

GHajba
GHajba

Reputation: 3691

If you look at the not-JSON result it is clear that it contains a JSON.

If you remove from the response the typeof angular.callbacks._6 === "function" && angular.callbacks._6( initial part and ); at the end you get a valid JSON which you can validate with JSONLint.

Eventually the solution is to find the first and last occurrence of { and } respectively, in the response and extract the text inside (inclusive of those curly brackets) and use that with json.loads instead of the whole result.

Upvotes: 3

Related Questions