Philipp Meissner
Philipp Meissner

Reputation: 5482

Use sidekiq with a running dynamic counter in Rails

I build a website-crawler that (later on) uses these links to read out information.

The current rake-task goes through all the possible pages one by one and checks if the requests goes trough (valid response) or returns a 404/503 error (invalid page). If it's valid the pages url gets saved into my database. Now as you can see the task requests 50,000 pages in total thus requires some time.

I have read about Sidekiq and how it can perform these tasks asynchronously thus making this a lot faster.

My question: As you can see my task builds the counter after each loop. This will not work with Sidekiq I guess as it will only perform this script independent of itself various times, am I right?

How would I go around the problem of each instance needing its own counter then?

Hopefully my question makes sense - Thank you very much!

desc "Validate Pages"
task validate_url: :environment do
  require 'rubygems'
  require 'open-uri'
  require 'nokogiri'

  counter = 1
  base_url = "http://example.net/file"
  until counter > 50000 do
    begin
      url = "#{base_url}_#{counter}/"

      open(url)


      page = Page.new
      page.url = url
      page.save!

      puts "Saved #{url} !"

      counter += 1

    rescue OpenURI::HTTPError => ex
      logger ||= Logger.new("validations.log")
      if ex.io.status[0] == "503"
        logger.info "#{ex} @ #{counter}"
      end

      puts "#{ex} @ #{counter}"
      counter += 1

    rescue SocketError => ex
      logger ||= Logger.new("validations.log")
      logger.info "#{ex} @ #{counter}"

      puts "#{ex} @ #{counter}"

      counter += 1
    end
  end
end

Upvotes: 0

Views: 210

Answers (2)

Mike Perham
Mike Perham

Reputation: 22228

A simple Redis INCR operation will create and/or increment a global counter for your jobs to use. You can use Sidekiq's redis connection to implement a counter trivially:

Sidekiq.redis do |conn|
  conn.incr("my-counter")
end

Upvotes: 1

Avdept
Avdept

Reputation: 2289

If you want to use it async - that means you will have many instances of same job. The fastest approach - to use something like redis. This will give you simple and fast way to check\update counter for your needs. But also make sure you took care about counter: If one of your jobs using it, lock it for other jobs, so there wont be wrong results, etc

Upvotes: 0

Related Questions