vrtech77
vrtech77

Reputation: 159

Scrape elements inside tag properties - Scrapy

I am using Scrapy to scrape a video site. I am having a little difficulty to scrape some things.

Ex.

<embed width="588" height="476" flashvars="id_video=7845976&amp;theskin=default&amp;url_bigthumb=http://sample.com/image.jpg&amp;key=4219e347d8fdc0be3103eb3cbb458258-1416371743&amp;categories=cat1" allowscriptaccess="always" allowfullscreen="true" quality="high" src="http://static.sample.com/swf/xv-player.swf" wmode="transparent" id="flash-player-embed" type="application/x-shockwave-flash">

I am currently able to scrape the properties of html tags using the statement below :-

item['thumb'] = hxs.select("//embed[@id='flash-player-embed']/@flashvars").extract()[0]

The above statement gives the below result :-

id_video=7845976&theskin=default&url_bigthumb=http://sample.com/image.jpg&key=4219e347d8fdc0be3103eb3cbb458258-1416371743&categories=cat1" allowscriptaccess="always" allowfullscreen="true" quality="high" src="http://static.sample.com/swf/xv-player.swf

I want a hxs.select statement such that it could extract only the image url from the above embedded code like this :-

http://sample.com/image.jpg

I have tried :-

item['thumb'] = hxs.select("//embed[@id='flash-player-embed']/@flashvars/@url_bigthumb").extract()[0]

but it's of no use as it is not working.

Any help from the Scrapy or the Python committee is very much appreciated as it will save my precious Megabits.

Thanks in advance.

Upvotes: 2

Views: 600

Answers (4)

Anandhakumar R
Anandhakumar R

Reputation: 391

My suggestion is you can use split function to obtain your exact result.

For example,

hxs.select('//embed[@id='flash-player-embed']/@flashvars').extract()[0].split('url_bigthumb=')[1].split('key')[0].replace('&amp;','').strip().replace('&','').strip()

This is the simplest way you can use as of now, but you can wait for the good answers.

Thanks

Upvotes: 1

Nima Soroush
Nima Soroush

Reputation: 12814

urlparse also provide a nice solution for getting elements:

>>from urlparse import parse_qs, urlparse
>>url = '?' + 'id_video=7845976&theskin=default&url_bigthumb=http://sample.com/image.jpg&key=4219e347d8fdc0be3103eb3cbb458258-1416371743&categories=cat1" allowscriptaccess="always" allowfullscreen="true" quality="high" src="http://static.sample.com/swf/xv-player.swf'

>>print parse_qs(urlparse(url).query)['url_bigthumb']
['http://sample.com/image.jpg']

Upvotes: 2

Elias Dorneles
Elias Dorneles

Reputation: 23796

Use a regular expression after the XPath selection with the .re() method:

>>> sel = Selector(text="""<embed width="588" height="476" flashvars="id_video=7845976&amp;theskin=default&amp;url_bigthumb=http://sample.com/image.jpg&amp;key=4219e347d8fdc0be3103eb3cbb458258-1416371743&amp;categories=cat1" allowscriptaccess="always" allowfullscreen="true" quality="high" src="http://static.sample.com/swf/xv-player.swf" wmode="transparent" id="flash-player-embed" type="application/x-shockwave-flash">""")
>>> sel.xpath("//embed/@flashvars").re('url_bigthumb=([^&]+)')
[u'http://sample.com/image.jpg']

Read more:

Upvotes: 0

aberna
aberna

Reputation: 5814

A quick solution using regex would be:

re.findall(r'http?://[^\s<>&"]+|www\.[^\s<>&"]+', item['thumb'])[0]

Upvotes: 0

Related Questions