Dhruv
Dhruv

Reputation: 13

How check correct url protocol in ruby?

I have list of 50,000 websites and I want to know what kind of protocol they have. All the website i have has all the names.com or like something.com but none of them have http://google.com. I did try to run the each and check manually like..

require 'rubygems'

require 'open-uri'
require 'io/console'
require 'open_uri_redirections'
require 'openssl'

OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE



filename = "./testfile.txt"
destination = File.open("./11aa.txt", "a")

newArray = Array.new
newArray = IO.readlines(filename)
newArray.each do |url|
begin
    puts "#{url}"
    if open(url,:read_timeout=>2 )
        destination.write "#{url}"      
    end

rescue => e
  puts e.message
end
    end

which did work but takes forever to finish. I am looking for better algorithm to check.

Thanks

Upvotes: 1

Views: 1607

Answers (3)

Dhruv
Dhruv

Reputation: 13

require 'open-uri'

def correct_url_protocol(single_url)
    puts "-----------------------In correct_url_protocol--------------------------"

        begin
         good_link = "http://www.#{single_url}"
            if open(good_link, read_timeout: 3,:allow_redirections => :all)
                "http://www.#{single_url}"
            else 
                "https://www.#{single_url}"
            end 
        rescue => e
            exp = e.message
            if exp.match("redirection forbidden")
                good_link = "https://www.#{single_url}"
                good_link

                end
            puts e.message
            good_link
        end 
end

I think this is the best approach I created. Let me know if any better.

Upvotes: 0

the Tin Man
the Tin Man

Reputation: 160551

"Protocol"? As in the IP protocol used to connect to a host as defined by the URL?

require 'uri'

URI.parse('http://foo.com').scheme # => "http"
URI.parse('https://foo.com').scheme # => "https"
URI.parse('ftp://foo.com').scheme # => "ftp"
URI.parse('scp://foo.com').scheme # => "scp"

If you want to know whether a site accepts HTTPS vs. HTTP, I'd start by checking for HTTPS, as the majority of sites allow HTTP:

require 'net/http'

%w[
  example.com
  www.example.com
  mail.google.com
  account.dyn.com
].each do |url|
  begin
    Net::HTTP.start(url, 443, :use_ssl => true) {}
    puts "#{url} is HTTPS"
  rescue
    puts "#{url} is HTTP"
  end
end
# >> example.com is HTTP
# >> www.example.com is HTTP
# >> mail.google.com is HTTPS
# >> account.dyn.com is HTTPS

Even though mail.google.com and account.dyn.com are HTTPS, if you test them for HTTP first, you'll see they also have that protocol. Some sites will redirect their HTTP request to their HTTPS server, others run both to allow a user to decide whether they want HTTP or HTTPS. You can test both protocols to figure out which cases are true.

start doesn't require a block, but by providing an empty one it will automatically close the connection immediately after establishing it.

Sites don't necessarily run their web services on ports 80 and 443. As a result, assuming the connection should be to one of those ports isn't necessarily right and could give you bad results if they use a different one. 8080 and 8081 are also often used so those should be checked too.

Also, a site might respond on a port, but its content could be a redirect pointing you to the real port they want you to use, so you need to also consider whether you should only care about the connection succeeding, or look inside the HTTPd headers, or actually read the entire page returned, and parse it in case it's a software redirect.

In other words, a connection succeeding doesn't tell you enough about what the site wants you to use, you'll have to conduct additional tests too.

Upvotes: 1

Nate Fox
Nate Fox

Reputation: 2513

Which protocol do you care about most? Is HTTPS preferable over HTTP? Some have both, some are redirects (http://www.google.com is a 302)

If you dont care which one it is, then go with http first as its probably more likely, so calls to it should be considerably faster.

Also, I'd drop the read_timeout down to 1 or even 500ms. If a site doesnt respond within that time, it might as well be dead (we're talking a simple response, not fully downloading all assets for the DOM).

Upvotes: 0

Related Questions