Getting all URLs using anemone gem (very large site)

Question

The site I want to index is fairly big, 1.x million pages. I really just want a json file of all the URLs so I can run some operations on them (sorting, grouping, etc).

The basic anemome loop worked well:

require 'anemone'

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.on_every_page do |page|
      puts page.url
  end
end

But (because of the site size?) the terminal froze after a while. Therefore, I installed MongoDB and used the following

require 'rubygems'
require 'anemone'
require 'mongo'
require 'json'


$stdout = File.new('sitemap.json','w')


Anemone.crawl("http://www.mybigexamplesite.com/") do |anemone|
  anemone.storage = Anemone::Storage.MongoDB
  anemone.on_every_page do |page|
      puts page.url
  end
end

It's running now, but I'll be very surprised if there's output in the json file when I get back in the morning - I've never used MongoDB before and the part of the anemone docs about using storage weren't clear (to me at least). Can anyone who's done this before give me some tips?

mustacheMcGee · Accepted Answer

If anyone out there needs <= 100,000 URLs, the Ruby Gem Spidr is a great way to go.

Getting all URLs using anemone gem (very large site)

Answers (2)

Related Questions