Reputation: 13
I have list of 50,000 websites and I want to know what kind of protocol they have. All the website i have has all the names.com or like something.com but none of them have http://google.com. I did try to run the each and check manually like..
require 'rubygems'
require 'open-uri'
require 'io/console'
require 'open_uri_redirections'
require 'openssl'
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
filename = "./testfile.txt"
destination = File.open("./11aa.txt", "a")
newArray = Array.new
newArray = IO.readlines(filename)
newArray.each do |url|
begin
puts "#{url}"
if open(url,:read_timeout=>2 )
destination.write "#{url}"
end
rescue => e
puts e.message
end
end
which did work but takes forever to finish. I am looking for better algorithm to check.
Thanks
Upvotes: 1
Views: 1607
Reputation: 13
require 'open-uri'
def correct_url_protocol(single_url)
puts "-----------------------In correct_url_protocol--------------------------"
begin
good_link = "http://www.#{single_url}"
if open(good_link, read_timeout: 3,:allow_redirections => :all)
"http://www.#{single_url}"
else
"https://www.#{single_url}"
end
rescue => e
exp = e.message
if exp.match("redirection forbidden")
good_link = "https://www.#{single_url}"
good_link
end
puts e.message
good_link
end
end
I think this is the best approach I created. Let me know if any better.
Upvotes: 0
Reputation: 160551
"Protocol"? As in the IP protocol used to connect to a host as defined by the URL?
require 'uri'
URI.parse('http://foo.com').scheme # => "http"
URI.parse('https://foo.com').scheme # => "https"
URI.parse('ftp://foo.com').scheme # => "ftp"
URI.parse('scp://foo.com').scheme # => "scp"
If you want to know whether a site accepts HTTPS vs. HTTP, I'd start by checking for HTTPS, as the majority of sites allow HTTP:
require 'net/http'
%w[
example.com
www.example.com
mail.google.com
account.dyn.com
].each do |url|
begin
Net::HTTP.start(url, 443, :use_ssl => true) {}
puts "#{url} is HTTPS"
rescue
puts "#{url} is HTTP"
end
end
# >> example.com is HTTP
# >> www.example.com is HTTP
# >> mail.google.com is HTTPS
# >> account.dyn.com is HTTPS
Even though mail.google.com and account.dyn.com are HTTPS, if you test them for HTTP first, you'll see they also have that protocol. Some sites will redirect their HTTP request to their HTTPS server, others run both to allow a user to decide whether they want HTTP or HTTPS. You can test both protocols to figure out which cases are true.
start
doesn't require a block, but by providing an empty one it will automatically close the connection immediately after establishing it.
Sites don't necessarily run their web services on ports 80 and 443. As a result, assuming the connection should be to one of those ports isn't necessarily right and could give you bad results if they use a different one. 8080 and 8081 are also often used so those should be checked too.
Also, a site might respond on a port, but its content could be a redirect pointing you to the real port they want you to use, so you need to also consider whether you should only care about the connection succeeding, or look inside the HTTPd headers, or actually read the entire page returned, and parse it in case it's a software redirect.
In other words, a connection succeeding doesn't tell you enough about what the site wants you to use, you'll have to conduct additional tests too.
Upvotes: 1
Reputation: 2513
Which protocol do you care about most? Is HTTPS preferable over HTTP? Some have both, some are redirects (http://www.google.com is a 302)
If you dont care which one it is, then go with http first as its probably more likely, so calls to it should be considerably faster.
Also, I'd drop the read_timeout down to 1 or even 500ms. If a site doesnt respond within that time, it might as well be dead (we're talking a simple response, not fully downloading all assets for the DOM).
Upvotes: 0