Reputation: 364
I am trying to use Scrapy to scrape this Website.
First of all here is my code -:
from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
#from scrapy import log, signals
from scrapy.utils.log import configure_logging
#from dmoz.spiders.dmoz_spiders import DmozSpider
#from dmoz.spiders.bigbasketspider import BBSpider
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
#query=raw_input("Enter a product to search for= ")
query='table'
query1=query.replace(" ", "+")
class DmozItem(scrapy.Item):
productname = scrapy.Field()
product_link = scrapy.Field()
current_price = scrapy.Field()
mrp = scrapy.Field()
offer = scrapy.Field()
imageurl = scrapy.Field()
outofstock_status = scrapy.Field()
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["http://www.pepperfry.com"]
def start_requests(self):
task_urls = [
]
i=1
for i in range(1,11):
temp = "http://www.pepperfry.com/site_product/search?is_search=true&p="+str(i)+"&q="+query1
task_urls.append(temp)
#raw_input()
i=i+1
start_urls = (task_urls)
p=len(task_urls)
return [ Request(url = start_url) for start_url in start_urls ]
def parse(self, response):
print response
items = []
for sel in response.xpath('//html/body/div[2]/div[2]/div[2]/div[4]/div'):
item = DmozItem()
item['productname'] = str(sel.xpath('div[1]/a/img/@alt').extract())[3:-2]
item['product_link'] = str(sel.xpath('div[2]/a/@href').extract())[3:-2]
item['current_price']=str(sel.xpath('div[3]/div/span[2]/span/text()').extract())[3:-2]
try:
temp1=sel.xpath('div[3]/div/span[1]/p/span')
item['mrp'] = str(temp1.xpath('text()').extract())[3:-2]
except:
item['mrp'] = item['current_price']
item['offer'] = 'No additional offer available'
item['imageurl'] = str(sel.xpath('div[1]/a//img/@src').extract())[3:-2]
item['outofstock_status'] = 'In Stock'
items.append(item)
print (items)
#print '\n'
spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", {"dmoz"})
settings.set("CONCURRENT_REQUESTS" , 100)
settings.set( "DEPTH_PRIORITY" , 1)
settings.set("SCHEDULER_DISK_QUEUE" , "scrapy.squeues.PickleFifoDiskQueue")
settings.set( "SCHEDULER_MEMORY_QUEUE" , "scrapy.squeues.FifoMemoryQueue")
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()
The website uses XHR for loading products, which I have correctly figured it out (you can notice the XHR URL in my start_urls array in my code) , and it is working. The next issue, here is that the website loads images also using AJAX / Javascript (I am not sure which one is used by this website). So, if you clearly execute my script (my code), you'll find that there's a loading image that gets scraped in-spite of the actual image.
How do I send requests to the page to load the images (because Images aren't loaded using XHR) , before I start to scrape, so that I can scrape all the images?
Please give me a valid, working code (solution), specifically for my code. Thanks! :)
Upvotes: 1
Views: 775
Reputation: 3691
If I look at the source of the site under one of your task_urls, let's say str(i)
evaluates to 2, I see in the source-code the images, however the images itself are not in the src
attribute of the img
tag but in the data-src
attribute.
If I let a simple Spider go for it I get the URLs of the images.
for i in response.xpath("//a/img[1]"):
print i.xpath("./@data-src").extract()
So try changing your XPath expression from src
to data-src
and give it a try. Changing this line gave the correct (perfect) solution -:
item['imageurl'] = str(sel.xpath('div[1]/a//img/@data-src').extract())[3:-2]
Upvotes: 2