Reputation: 4122
I am still trying to get my head around json.loads and json.dumps to extract what I want from a web page. I am after some data from this link that takes the format of:
data:{
url: 'stage-player-stat'
},
defaultParams: {
stageId: 9155,
teamId: 32,
playerId: -1,
field: 2
},
The code I am using is this:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
import requests
class ExampleSpider(CrawlSpider):
name = "goal2"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Teams/32/"]
rules = [Rule(SgmlLinkExtractor(allow=('\Teams'),deny=(),), follow=False, callback='parse_item')]
def parse_item(self, response):
stagematch = re.compile("data:\s*{\s*url:\s*'stage-player-stat'\s*},\s*defaultParams:\s*{\s*(.*?),.*},",re.S)
stagematch2 = re.search(stagematch, response.body)
if stagematch2 is not None:
stagematch3 = stagematch2.group(1)
stageid = json.dumps(stagematch3)
print "stageid = ", stageid
execute(['scrapy','crawl','goal2'])
In this example, stageId
resolves to "stageId: 9155"
. What I want it to resolve to though is 9155
. I have tried to parse stageId
with stageid = stageid[0]
as if it is a dictionary, but this is not working. What am I doing wrong?
Thanks
Upvotes: 0
Views: 429
Reputation: 20748
A solution using js2xml:
<script>
contentsvar defaultTeamPlayerStatsConfigParams
and get it's init object
js2xml.jsonlike.make_dict()
to get a Python dict
out of itHere's how it goes, illustrated in this scrapy shell session:
$ scrapy shell http://www.whoscored.com/Teams/32/
2014-09-08 11:17:31+0200 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
...
2014-09-08 11:17:32+0200 [default] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/32/> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f88f0605990>
[s] item {}
[s] request <GET http://www.whoscored.com/Teams/32/>
[s] response <200 http://www.whoscored.com/Teams/32/>
[s] settings <scrapy.settings.Settings object at 0x7f88f6046450>
[s] spider <Spider 'default' at 0x7f88efdaff50>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: import pprint
In [2]: import js2xml
In [3]: for script in response.xpath('//script/text()').extract():
jstree = js2xml.parse(script)
params = jstree.xpath('//var[@name="defaultTeamPlayerStatsConfigParams"]/object')
if params:
pprint.pprint(js2xml.jsonlike.make_dict(params[0]))
...:
{'data': {'url': 'stage-player-stat'},
'defaultParams': {'field': 2, 'playerId': -1, 'stageId': 9155, 'teamId': 32},
'fitText': {'container': '.grid .team-link, .grid .player-link',
'options': {'width': 150}},
'fixZeros': True}
In [4]: for script in response.xpath('//script/text()').extract():
jstree = js2xml.parse(script)
params = jstree.xpath('//var[@name="defaultTeamPlayerStatsConfigParams"]/object')
if params:
params = js2xml.jsonlike.make_dict(params[0])
...: print params["defaultParams"]["stageId"]
...:
9155
In [5]:
Upvotes: 2
Reputation: 87201
stagematch3 = stagematch2.group(1)
stageid = int(stagematch3.split(':', 1)[1])
Then you may convert it back to str if you wish:
stageid = str(stageid)
There are many other ways to solve your problem. One of them is using a simpler regexp and then parsing the match object with json.loads
.
Upvotes: 1