gdogg371
gdogg371

Reputation: 4122

Get desired value from this with json.dumps()

I am still trying to get my head around json.loads and json.dumps to extract what I want from a web page. I am after some data from this link that takes the format of:

data:{
                url: 'stage-player-stat'
            },
            defaultParams: {
                stageId: 9155,
                teamId: 32,
                playerId: -1,
                field: 2
            },

The code I am using is this:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
import requests

class ExampleSpider(CrawlSpider):
    name = "goal2"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/Teams/32/"]

    rules = [Rule(SgmlLinkExtractor(allow=('\Teams'),deny=(),), follow=False, callback='parse_item')]

    def parse_item(self, response):

        stagematch = re.compile("data:\s*{\s*url:\s*'stage-player-stat'\s*},\s*defaultParams:\s*{\s*(.*?),.*},",re.S)

        stagematch2 = re.search(stagematch, response.body)

        if stagematch2 is not None:
            stagematch3 = stagematch2.group(1)


            stageid = json.dumps(stagematch3)

            print "stageid = ", stageid

    execute(['scrapy','crawl','goal2'])  

In this example, stageId resolves to "stageId: 9155". What I want it to resolve to though is 9155. I have tried to parse stageId with stageid = stageid[0] as if it is a dictionary, but this is not working. What am I doing wrong?

Thanks

Upvotes: 0

Views: 429

Answers (2)

paul trmbrth
paul trmbrth

Reputation: 20748

A solution using js2xml:

  • get all <script> contents
  • parse each with js2xml: you get a lxml tree back
  • using XPath on the lxml document, look for a var defaultTeamPlayerStatsConfigParams and get it's init object
  • use js2xml.jsonlike.make_dict() to get a Python dict out of it

Here's how it goes, illustrated in this scrapy shell session:

$ scrapy shell http://www.whoscored.com/Teams/32/
2014-09-08 11:17:31+0200 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
...
2014-09-08 11:17:32+0200 [default] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/32/> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f88f0605990>
[s]   item       {}
[s]   request    <GET http://www.whoscored.com/Teams/32/>
[s]   response   <200 http://www.whoscored.com/Teams/32/>
[s]   settings   <scrapy.settings.Settings object at 0x7f88f6046450>
[s]   spider     <Spider 'default' at 0x7f88efdaff50>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: import pprint

In [2]: import js2xml

In [3]: for script in response.xpath('//script/text()').extract():
    jstree = js2xml.parse(script)
    params = jstree.xpath('//var[@name="defaultTeamPlayerStatsConfigParams"]/object')
    if params:
        pprint.pprint(js2xml.jsonlike.make_dict(params[0]))
   ...:         
{'data': {'url': 'stage-player-stat'},
 'defaultParams': {'field': 2, 'playerId': -1, 'stageId': 9155, 'teamId': 32},
 'fitText': {'container': '.grid .team-link, .grid .player-link',
             'options': {'width': 150}},
 'fixZeros': True}

In [4]: for script in response.xpath('//script/text()').extract():
    jstree = js2xml.parse(script)
    params = jstree.xpath('//var[@name="defaultTeamPlayerStatsConfigParams"]/object')
    if params:
        params = js2xml.jsonlike.make_dict(params[0])
   ...:         print params["defaultParams"]["stageId"]
   ...:         
9155

In [5]: 

Upvotes: 2

pts
pts

Reputation: 87201

stagematch3 = stagematch2.group(1)
stageid = int(stagematch3.split(':', 1)[1])

Then you may convert it back to str if you wish:

stageid = str(stageid)

There are many other ways to solve your problem. One of them is using a simpler regexp and then parsing the match object with json.loads.

Upvotes: 1

Related Questions