vipin thapliyal
vipin thapliyal

Reputation: 21

unable to extract full url @href using scrapy

I am trying to extract the url of a product from amazon.in. The href-attribute inside the a-tag from the source looks like this:

href="/Parachute-Coconut-Oil-600-Free/dp/B081WSB91C/ref=sr_1_49?dchild=1&fpw=pantry&fst=as%3Aoff&qid=1588693187&s=pantry&sr=8-49&srs=9574332031&swrs=789D2F4EC1B25821250A55BFCB953F03"

What Scrapy is extracting is:

/Parachute-Coconut-Oil-Bottle-600ml/dp/B071FB2ZVT?dchild=1

I used the following xpath:

//div[@class="a-section a-spacing-none a-spacing-top-small"]//a[@class="a-link-normal a-text-normal"]/@href

This is the website I am trying to scrape:
https://www.amazon.in/s?i=pantry&srs=9574332031&bbn=9735693031&rh=n%3A9735693031&dc&page=2&fst=as%3Aoff&qid=1588056650&swrs=789D2F4EC1B25821250A55BFCB953F03&ref=sr_pg_2

How can I extract the expected url with Scrapy?

Upvotes: 2

Views: 575

Answers (1)

dram95
dram95

Reputation: 687

That is known as a relative URL. To get the full URL you can simply combine it to the base URL. I don't know what your code is but try something like this.

half_url = response.xpath('//div[@class="a-section a-spacing-none a-spacing-top-small"]//a[@class="a-link-normal a-text-normal"]/@href').extract_first()
full_url = 'https://www.amazon.in/' + half_url

Upvotes: 1

Related Questions