Reputation: 11
I am scraping the Flipcart website and I want to extract the image URL from the website. This is the link to the website.
import scrapy
from ..items import FlipcartItem
class QuotesSpider(scrapy.Spider):
name='quotes'
start_urls=[
'https://www.flipkart.com/clothing-and-accessories/topwear/pr?sid=clo%2Cash&otracker=categorytree&p%5B%5D=facets.ideal_for%255B%255D%3DMen'
]
def parse(self,response):
items=FlipcartItem()
image_url=response.css('._2r_T1I img::attr(src)').extract()
#product_page_url=response.css('').extract()
items['image_url']=image_url
#items['product_page']=title
yield items
This is the code I have written and while running the code I am getting the empty list.Like image_url ["","",""].Can anyone please suggest where I am going wrong?
Upvotes: 0
Views: 112
Reputation: 1053
This is a Javascript generated content site. Use "View page source" and you can see that the image src is empty. Nothings wrong with the code. Just use Selenium or Scrapy Splash they load all the javascripts for you so you can scraped the data.
Upvotes: 0
Reputation: 2609
This site is using javascript to load images that scrapy won't access. You need to use selenium to extract image data. Here i use scrapy Selector to extract image data with selenium. You may use scrapy with selenium if you want follow this url or scrapy splash.
from selenium import webdriver
from scrapy.selector import Selector
browser = webdriver.Firefox(executable_path='./geckodriver')
browser.get(url="https://www.flipkart.com/clothing-and-accessories/topwear/pr?sid=clo%2Cash&otracker=categorytree&p%5B%5D=facets.ideal_for%255B%255D%3DMen")
page = browser.page_source
image_data = Selector(text=page)
image_data = image_data.css('img._2r_T1I::attr(src)').extract()
# print(image_data.xpath('//div[@class="CXW8mj _21_khk"]/img/@src').get())
print(image_data)
If you need to install selenium, please follow this url.
Upvotes: 1
Reputation: 11
I Tried Doing This :
import scrapy
class QuotesSpider(scrapy.Spider):
name='quotes'
start_urls=[
'https://www.flipkart.com/clothing-and-accessories/topwear/pr?sid=clo%2Cash&otracker=categorytree&p%5B%5D=facets.ideal_for%255B%255D%3DMen'
]
def parse(self,response):
raw_image_urls=response.css('img._2r_T1I').xpath('@src').getall()
clean_image_urls=[]
for img_url in raw_image_urls:
clean_image_urls.append(response.urljoin(img_url))
yield{
'image_urls':clean_image_urls
}
But getting the URL of the main page.Not image url.
Upvotes: 0
Reputation: 3875
You should consider changing this line:
image_url=response.css('._2r_T1I img::attr(src)').extract()
To this,
image_urls=response.css('img._2r_T1I').xpath('@src').getall()
Also you should be aware that your "image_url" is going to be an array even if there's only one item, as that's what scrapy returns. You may want to iterate over the results and create a new FlipcartItem
for each one, or if you only expect one result you may want to pull it out of the list.
Upvotes: 0