Reputation: 159
I am using Scrapy to scrape a video site. I am having a little difficulty to scrape some things.
Ex.
<embed width="588" height="476" flashvars="id_video=7845976&theskin=default&url_bigthumb=http://sample.com/image.jpg&key=4219e347d8fdc0be3103eb3cbb458258-1416371743&categories=cat1" allowscriptaccess="always" allowfullscreen="true" quality="high" src="http://static.sample.com/swf/xv-player.swf" wmode="transparent" id="flash-player-embed" type="application/x-shockwave-flash">
I am currently able to scrape the properties of html tags using the statement below :-
item['thumb'] = hxs.select("//embed[@id='flash-player-embed']/@flashvars").extract()[0]
The above statement gives the below result :-
id_video=7845976&theskin=default&url_bigthumb=http://sample.com/image.jpg&key=4219e347d8fdc0be3103eb3cbb458258-1416371743&categories=cat1" allowscriptaccess="always" allowfullscreen="true" quality="high" src="http://static.sample.com/swf/xv-player.swf
I want a hxs.select statement such that it could extract only the image url from the above embedded code like this :-
I have tried :-
item['thumb'] = hxs.select("//embed[@id='flash-player-embed']/@flashvars/@url_bigthumb").extract()[0]
but it's of no use as it is not working.
Any help from the Scrapy or the Python committee is very much appreciated as it will save my precious Megabits.
Thanks in advance.
Upvotes: 2
Views: 600
Reputation: 391
My suggestion is you can use split function to obtain your exact result.
For example,
hxs.select('//embed[@id='flash-player-embed']/@flashvars').extract()[0].split('url_bigthumb=')[1].split('key')[0].replace('&','').strip().replace('&','').strip()
This is the simplest way you can use as of now, but you can wait for the good answers.
Thanks
Upvotes: 1
Reputation: 12814
urlparse also provide a nice solution for getting elements:
>>from urlparse import parse_qs, urlparse
>>url = '?' + 'id_video=7845976&theskin=default&url_bigthumb=http://sample.com/image.jpg&key=4219e347d8fdc0be3103eb3cbb458258-1416371743&categories=cat1" allowscriptaccess="always" allowfullscreen="true" quality="high" src="http://static.sample.com/swf/xv-player.swf'
>>print parse_qs(urlparse(url).query)['url_bigthumb']
['http://sample.com/image.jpg']
Upvotes: 2
Reputation: 23796
Use a regular expression after the XPath selection with the .re() method:
>>> sel = Selector(text="""<embed width="588" height="476" flashvars="id_video=7845976&theskin=default&url_bigthumb=http://sample.com/image.jpg&key=4219e347d8fdc0be3103eb3cbb458258-1416371743&categories=cat1" allowscriptaccess="always" allowfullscreen="true" quality="high" src="http://static.sample.com/swf/xv-player.swf" wmode="transparent" id="flash-player-embed" type="application/x-shockwave-flash">""")
>>> sel.xpath("//embed/@flashvars").re('url_bigthumb=([^&]+)')
[u'http://sample.com/image.jpg']
Upvotes: 0
Reputation: 5814
A quick solution using regex would be:
re.findall(r'http?://[^\s<>&"]+|www\.[^\s<>&"]+', item['thumb'])[0]
Upvotes: 0