Reputation: 15
performing a scrape using xpath with scrapy i dont get the full URL
here is the url i am looking at
using scrapy shell
scrapy shell "http://www.ybracing.com/omp-ia01854-omp-first-evo-race-suit.html"
i perform the following xpath select from the shell
sel.xpath("//*[@id='Thumbnail-Image-Container']/li[1]/a//@href")
and get only half the href
[<Selector xpath="//*[@id='Thumbnail-Image-Container']/li[1]/a//@href" data=u'http://images.esellerpro.com/2489/I/160/'>]
here's the snippet of html i am looking at in a browser
<li><a data-medimg="http://images.esellerpro.com/2489/I/160/260/1/medIA01854-GALLERY.jpg" href="http://images.esellerpro.com/2489/I/160/260/1/lrgIA01854-GALLERY.jpg" class="cloud-zoom-gallery Selected" title="OMP FIRST EVO RACE SUIT" rel="useZoom: 'MainIMGLink', smallImage: 'http://images.esellerpro.com/2489/I/160/260/1/lrgIA01854-GALLERY.jpg'"><img src="http://images.esellerpro.com/2489/I/160/260/1/smIA01854-GALLERY.jpg" alt="OMP FIRST EVO RACE SUIT Thumbnail 1"></a></li>
and here it is from wget
<li><a data-medimg="http://images.esellerpro.com/2489/I/513/0/medIA01838_GALLERY.JPG" href="http://images.esellerpro.com/2489/I/513/0/lrgIA01838_GALLERY.JPG" class="cloud-zoom-gallery Selected" title="OMP DYNAMO RACE SUIT" rel="useZoom: 'MainIMGLink', smallImage: 'http://images.esellerpro.com/2489/I/513/0/lrgIA01838_GALLERY.JPG'"><img src="http://images.esellerpro.com/2489/I/513/0/smIA01838_GALLERY.JPG" alt="OMP DYNAMO RACE SUIT Thumbnail 1" /></a></li>
i have tried varying my xpath to pull the same but still get the same result
what is causing this and what can i do to work around it would like to understand rather than someone just correct my xpath for me
some thoughts on the page itself i disabled javascript to see if the js was generating half the url but its not. I also downloaded the page with wget to confirm the urls are complete in the orriginal html
i havent tested any other builds but i'm using scrapy 1.2.1 on with 2.7 in centos 7
I've googled and only find people who cant grab the data due to javascript generating the data on the fly but my data is there in the html
Upvotes: 1
Views: 865
Reputation: 14731
By using
sel.xpath("//*[@id='Thumbnail-Image-Container']/li[1]/a//@href")
you get a list of Selector
instances, in which the data
field shows only the first few bytes of all its content (since it might be very long).
To retrieve the content as a string (instead of a Selector
instance), you would need to use something like .extract
or .extract_first
:
>>> print(sel.xpath("//*[@id='Thumbnail-Image-Container']/li[1]/a//@href").extract_first())
http://images.esellerpro.com/2489/I/160/260/1/lrgIA01854-GALLERY.jpg
Upvotes: 3