r_al_sim
r_al_sim

Reputation: 15

scrapy: xpath not returning the full url for @href

performing a scrape using xpath with scrapy i dont get the full URL

here is the url i am looking at

using scrapy shell

scrapy shell "http://www.ybracing.com/omp-ia01854-omp-first-evo-race-suit.html"

i perform the following xpath select from the shell

sel.xpath("//*[@id='Thumbnail-Image-Container']/li[1]/a//@href")

and get only half the href

[<Selector xpath="//*[@id='Thumbnail-Image-Container']/li[1]/a//@href" data=u'http://images.esellerpro.com/2489/I/160/'>]

here's the snippet of html i am looking at in a browser

        <li><a data-medimg="http://images.esellerpro.com/2489/I/160/260/1/medIA01854-GALLERY.jpg" href="http://images.esellerpro.com/2489/I/160/260/1/lrgIA01854-GALLERY.jpg" class="cloud-zoom-gallery Selected" title="OMP FIRST EVO RACE SUIT" rel="useZoom: 'MainIMGLink', smallImage: 'http://images.esellerpro.com/2489/I/160/260/1/lrgIA01854-GALLERY.jpg'"><img src="http://images.esellerpro.com/2489/I/160/260/1/smIA01854-GALLERY.jpg" alt="OMP FIRST EVO RACE SUIT Thumbnail 1"></a></li>            

and here it is from wget

<li><a data-medimg="http://images.esellerpro.com/2489/I/513/0/medIA01838_GALLERY.JPG" href="http://images.esellerpro.com/2489/I/513/0/lrgIA01838_GALLERY.JPG" class="cloud-zoom-gallery Selected" title="OMP DYNAMO RACE SUIT" rel="useZoom: 'MainIMGLink', smallImage: 'http://images.esellerpro.com/2489/I/513/0/lrgIA01838_GALLERY.JPG'"><img src="http://images.esellerpro.com/2489/I/513/0/smIA01838_GALLERY.JPG" alt="OMP DYNAMO RACE SUIT Thumbnail 1" /></a></li>            

i have tried varying my xpath to pull the same but still get the same result

what is causing this and what can i do to work around it would like to understand rather than someone just correct my xpath for me

some thoughts on the page itself i disabled javascript to see if the js was generating half the url but its not. I also downloaded the page with wget to confirm the urls are complete in the orriginal html

i havent tested any other builds but i'm using scrapy 1.2.1 on with 2.7 in centos 7

I've googled and only find people who cant grab the data due to javascript generating the data on the fly but my data is there in the html

Upvotes: 1

Views: 865

Answers (1)

starrify
starrify

Reputation: 14731

By using

sel.xpath("//*[@id='Thumbnail-Image-Container']/li[1]/a//@href")

you get a list of Selector instances, in which the data field shows only the first few bytes of all its content (since it might be very long).

To retrieve the content as a string (instead of a Selector instance), you would need to use something like .extract or .extract_first:

>>> print(sel.xpath("//*[@id='Thumbnail-Image-Container']/li[1]/a//@href").extract_first())
http://images.esellerpro.com/2489/I/160/260/1/lrgIA01854-GALLERY.jpg

Upvotes: 3

Related Questions