user14045512
user14045512

Reputation:

Fastest way to check if a url exists

currently I am writing a program that needs to check tons of possible urls searching for any that actually exist. To be precise, I mean exist as in you can visit the url and there's actual content of some sort.. not string parsing to see if it's in url format.

The program generates a list of possible variants for a filename and then checks each one until it gets a url that actually exists, so most of the url remains the same. Examples would be,

https://www.test.com/folder1/FILE.png
https://www.test.com/folder1/File.png
https://www.test.com/folder1/file.png
https://www.test.com/folder1/file1.png

That said, my code currently works fine.. however it ends up taking about 2-4 secods per url check and I don't know of a way to speed it up. Is there any faster or better way to validate urls or am I just out of luck?

This is my function to validate urls:

require "net/http"

def url_exist? url_path
  url = URI.parse(url_path)
  req = Net::HTTP.new(url.host, url.port)
  req.use_ssl = true
  res = req.request_head(url.path)
 
  if res.code == "200" || res.code == "403"
    return true
  end
end

Thank you for taking the time to read this and any help will be much appreciated.

Upvotes: 2

Views: 3482

Answers (1)

Stefan
Stefan

Reputation: 114208

Your code creates a new connection for each URL. It should be faster to send multiple requests over the same connection via HTTP keep-alive.

In Ruby, you can open such connection via Net::HTTP.start, e.g.:

require 'net/http'

class URLChecker
  def initialize(base_url)
    uri = URI(base_url)
    Net::HTTP.start(uri.host, uri.port, use_ssl: uri.is_a?(URI::HTTPS)) do |http|
      @http = http
      yield self
    end
  end

  def exist?(path)
    res = @http.head(path)
    res.code == '200' || res.code == '403'
  end
end

URLChecker.new('https://stackoverflow.com') do |uc|
  p uc.exist?('/questions/tagged/ruby')   #=> true
  p uc.exist?('/questions/tagged/python') #=> true
  p uc.exist?('/questions/tagged/foobar') #=> false
end

Upvotes: 1

Related Questions