Reputation: 657
I am trying to scrap result of the following page :
http://www.peekyou.com/work/autodesk/page=1
with page = 1,2,3,4 ... so on as per the results. So I am getting a php file to run the crawler run it for different page numbers. The code (for a single page) is as follows:
`import sys
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.http import Request
#from scrapy.crawler import CrawlerProcess
class DmozSpider(BaseSpider):
name = "peekyou_crawler"
start_urls = ["http://www.peekyou.com/work/autodesk/page=1"];
def parse(self, response):
hxs = HtmlXPathSelector(response)
discovery = hxs.select('//div[@class="nextPage"]/table/tr[2]/td/a[contains(@title,"Next")]')
print len(discovery)
print "Starting the actual file"
items = hxs.select('//div[@class="resultCell"]')
count = 0
for newsItem in items:
print newsItem
url=newsItem.select('h2/a/@href').extract()
name = newsItem.select('h2/a/span/text()').extract()
count = count + 1
print count
print url[0]
print name[0]
print "\n"
` The Autodesk result page has 18 pages. When I run the code to crawl all the pages, the crawler only gets data from page 2 and not all pages. Similarly, I changed the company name to be something else. Again, it scraps some pages and rest not. I am getting http response 200 on each of the page although. Also, even I keep running it again, it continues to scrap the same pages always but not all always. Any idea as to what could be the error in my approach or something am I missing ?
Thanks in advance.
Upvotes: 1
Views: 2856
Reputation: 473873
I'll give you a starting point.
The page you're trying to crawl is loaded via AJAX, this is a problem with scrapy - it cannot deal with dynamic page load via ajax XHR requests. For more info see:
Using browser developer tools, you could notice that there is an outgoing POST request going after the page load. It's going to http://www.peekyou.com/work/autodesk/web_results/web_tag_search_checker.php.
So, simulating this in scrapy should help you to crawl necessary data:
from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozItem(Item):
name = Field()
link = Field()
class DmozSpider(BaseSpider):
name = "peekyou_crawler"
start_urls = start_urls = [
"http://www.peekyou.com/work/autodesk/page=%d" % i for i in xrange(18)
]
def parse(self, response):
yield FormRequest(url="http://www.peekyou.com/work/autodesk/web_results/web_tag_search_checker.php",
formdata={'id': 'search_work_a10362ede5ed8ed5ff1191321978f12a',
'_': ''},
method="POST",
callback=self.after_post)
def after_post(self, response):
hxs = HtmlXPathSelector(response)
persons = hxs.select("//div[@class='resultCell']")
for person in persons:
item = DmozItem()
item['name'] = person.select('.//h2/a/span/text()').extract()[0].strip()
item['link'] = person.select('.//h2/a/@href').extract()[0].strip()
yield item
It works, but it dumps only the first page. I'll leave it for you to understand how can you get other results.
Hope that helps.
Upvotes: 1
Reputation: 142661
You can add more addresses:
start_urls = [
"http://www.peekyou.com/work/autodesk/page=1",
"http://www.peekyou.com/work/autodesk/page=2",
"http://www.peekyou.com/work/autodesk/page=3"
];
You can generate more addresses:
start_urls = [
"http://www.peekyou.com/work/autodesk/page=%d" % i for i in xrange(18)
];
I think you should read about start_requests()
and how to generate next url. But I can't help you here, because I don't use Scrapy. I still use pure python (and pyQuery) to create simple crawlers ;)
PS. Sometimes servers check your UserAgent, IP, how fast you grap next page and stop sending pages to you.
Upvotes: 1