scrapy: xpath not returning the full url for @href

Question

performing a scrape using xpath with scrapy i dont get the full URL

here is the url i am looking at

using scrapy shell

scrapy shell "http://www.ybracing.com/omp-ia01854-omp-first-evo-race-suit.html"

i perform the following xpath select from the shell

sel.xpath("//*[@id='Thumbnail-Image-Container']/li[1]/a//@href")

and get only half the href

[]

here's the snippet of html i am looking at in a browser

and here it is from wget

i have tried varying my xpath to pull the same but still get the same result

what is causing this and what can i do to work around it would like to understand rather than someone just correct my xpath for me

some thoughts on the page itself i disabled javascript to see if the js was generating half the url but its not. I also downloaded the page with wget to confirm the urls are complete in the orriginal html

i havent tested any other builds but i'm using scrapy 1.2.1 on with 2.7 in centos 7

I've googled and only find people who cant grab the data due to javascript generating the data on the fly but my data is there in the html

starrify · Accepted Answer

By using

sel.xpath("//*[@id='Thumbnail-Image-Container']/li[1]/a//@href")

you get a list of Selector instances, in which the data field shows only the first few bytes of all its content (since it might be very long).

To retrieve the content as a string (instead of a Selector instance), you would need to use something like .extract or .extract_first:

>>> print(sel.xpath("//*[@id='Thumbnail-Image-Container']/li[1]/a//@href").extract_first())
http://images.esellerpro.com/2489/I/160/260/1/lrgIA01854-GALLERY.jpg

scrapy: xpath not returning the full url for @href

Answers (1)

Related Questions