n_i_c_k
n_i_c_k

Reputation: 1534

What is a regex to check to see if some text contains only URLs?

I'm trying to make a regular expression that checks if some text only contains urls and whitespaces and nothing else so:

http://www.google.com http://www.stackoverflow.com

would match, but:

http://www.google.com and http://www.stackoverflow.com

would not match.

Is this possible?

Upvotes: 3

Views: 221

Answers (5)

SaidbakR
SaidbakR

Reputation: 13544

This will check for any URL and the string should be URLs with single white-space as URLs separator only

Look at this live demo

(((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)\s){1,}((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)$

Reference:

Upvotes: 0

the Tin Man
the Tin Man

Reputation: 160551

Ruby already has a method to extract URLs, so that's a great starting place, rather than reinventing a working wheel:

require 'uri'

[
  'http://www.google.com http://www.stackoverflow.com',
  'http://www.google.com and http://www.stackoverflow.com'
].each do |url|
  print url
  if url.split.all? { |u| !URI.extract(u).empty? }
    puts " contains only URLs"
  else
    puts " doesn't contain only URLs"
  end
end

Which, after running, is:

http://www.google.com http://www.stackoverflow.com contains only URLs
http://www.google.com and http://www.stackoverflow.com doesn't contain only URLs

This doesn't support all the recognized URL schemes, but it is a starting point. You can specify which you want by passing an array of schemes to extract. You can get the IANA's permanent list using:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://www.iana.org/assignments/uri-schemes.html'))
schemes = doc.at('table table').search('tr').map{ |tr| tr.at('td').text }[1..-1]

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

you can use this regex (only test if that is between spaces begin with http://):

/^(?:https?:\/\/\S++\s*+)++$/ =~ text

Upvotes: 1

jomsk1e
jomsk1e

Reputation: 3625

If you really want to use regex, please try this:

(?< protocol>\w+):\/\/(?< domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*

Please remove the space before 'protocol' and 'domain'.

Split the string with the whitespaces, and check each string if it is match with the regex above.

Hope it helps!

Upvotes: 0

Explosion Pills
Explosion Pills

Reputation: 191749

words.split.all? { |word| word.match(/^http:/) }

Upvotes: 0

Related Questions