Reputation: 501
The site I want to index is fairly big, 1.x million pages. I really just want a json file of all the URLs so I can run some operations on them (sorting, grouping, etc).
The basic anemome loop worked well:
require 'anemone'
Anemone.crawl("http://www.example.com/") do |anemone|
anemone.on_every_page do |page|
puts page.url
end
end
But (because of the site size?) the terminal froze after a while. Therefore, I installed MongoDB and used the following
require 'rubygems'
require 'anemone'
require 'mongo'
require 'json'
$stdout = File.new('sitemap.json','w')
Anemone.crawl("http://www.mybigexamplesite.com/") do |anemone|
anemone.storage = Anemone::Storage.MongoDB
anemone.on_every_page do |page|
puts page.url
end
end
It's running now, but I'll be very surprised if there's output in the json file when I get back in the morning - I've never used MongoDB before and the part of the anemone docs about using storage weren't clear (to me at least). Can anyone who's done this before give me some tips?
Upvotes: 0
Views: 1352
Reputation: 501
If anyone out there needs <= 100,000 URLs, the Ruby Gem Spidr is a great way to go.
Upvotes: 3
Reputation: 6123
This is probably not the answer you wanted to see but I highly advice that you don't use Anemone and perhaps Ruby for that matter for crawling a million pages.
Anemone is not a maintained library and fails on many edge cases.
Ruby is not the fastest language and uses a global interpreter lock which means that you can't have true threading capabilities. I think your crawling will probably be too slow. For more information about threading, I suggest you can check out the following links.
http://ablogaboutcode.com/2012/02/06/the-ruby-global-interpreter-lock/
Does ruby have real multithreading?
You can try using anemone with Rubinius or JRuby which are much faster with but I'm not sure the extent of compatibility.
I had some mild success going from Anemone to Nutch but your mileage may vary.
Upvotes: 2