Kalpana Regmi
Kalpana Regmi

Reputation: 11

Web Scraping using Scrapy

I am scraping the Flipcart website and I want to extract the image URL from the website. This is the link to the website.

import scrapy
from ..items import FlipcartItem
class QuotesSpider(scrapy.Spider):
    name='quotes'
    start_urls=[
        'https://www.flipkart.com/clothing-and-accessories/topwear/pr?sid=clo%2Cash&otracker=categorytree&p%5B%5D=facets.ideal_for%255B%255D%3DMen'
        ]
    def parse(self,response):
        items=FlipcartItem()
        image_url=response.css('._2r_T1I img::attr(src)').extract()
        #product_page_url=response.css('').extract()
        items['image_url']=image_url
        #items['product_page']=title
        yield items

This is the code I have written and while running the code I am getting the empty list.Like image_url ["","",""].Can anyone please suggest where I am going wrong?

Upvotes: 0

Views: 112

Answers (4)

bonifacio_kid
bonifacio_kid

Reputation: 1053

This is a Javascript generated content site. Use "View page source" and you can see that the image src is empty. Nothings wrong with the code. Just use Selenium or Scrapy Splash they load all the javascripts for you so you can scraped the data.

Upvotes: 0

Samsul Islam
Samsul Islam

Reputation: 2609

This site is using javascript to load images that scrapy won't access. You need to use selenium to extract image data. Here i use scrapy Selector to extract image data with selenium. You may use scrapy with selenium if you want follow this url or scrapy splash.

from selenium import webdriver
from scrapy.selector import Selector
browser = webdriver.Firefox(executable_path='./geckodriver')
browser.get(url="https://www.flipkart.com/clothing-and-accessories/topwear/pr?sid=clo%2Cash&otracker=categorytree&p%5B%5D=facets.ideal_for%255B%255D%3DMen")

page = browser.page_source
image_data = Selector(text=page)
image_data = image_data.css('img._2r_T1I::attr(src)').extract()
# print(image_data.xpath('//div[@class="CXW8mj _21_khk"]/img/@src').get())

print(image_data)

If you need to install selenium, please follow this url.

Upvotes: 1

Kalpana Regmi
Kalpana Regmi

Reputation: 11

I Tried Doing This :

import scrapy
class QuotesSpider(scrapy.Spider):
    name='quotes'
    start_urls=[
        'https://www.flipkart.com/clothing-and-accessories/topwear/pr?sid=clo%2Cash&otracker=categorytree&p%5B%5D=facets.ideal_for%255B%255D%3DMen'
        ]
    def parse(self,response):
        raw_image_urls=response.css('img._2r_T1I').xpath('@src').getall()
        clean_image_urls=[]
        for img_url in raw_image_urls:
            clean_image_urls.append(response.urljoin(img_url))
        yield{
        'image_urls':clean_image_urls
        }

But getting the URL of the main page.Not image url.

Upvotes: 0

Robert Hafner
Robert Hafner

Reputation: 3875

You should consider changing this line:

image_url=response.css('._2r_T1I img::attr(src)').extract()

To this,

image_urls=response.css('img._2r_T1I').xpath('@src').getall()

Also you should be aware that your "image_url" is going to be an array even if there's only one item, as that's what scrapy returns. You may want to iterate over the results and create a new FlipcartItem for each one, or if you only expect one result you may want to pull it out of the list.

Upvotes: 0

Related Questions