Reputation: 95980

Regex to match URL where scheme is optional (without "http")

I am using the following regex to match a URL:

$search  = "/([\S]+\.(MUSEUM|TRAVEL|AERO|ARPA|ASIA|COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|AC|AD|AE|AF|AG|AI|AL|AM|AN|AO|AQ|AR|AS|AT|AU|au|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BJ|BL|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|EH|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|IO|IQ|IR|IS|IT|JE|JM|JO|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MF|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MV|MW|MX|MY|MZ|NA|NC|NE|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TF|TG|TH|TJ|TK|TL|TM|TN|TO|R|H|TP|TR|TT|TV|TW|TZ|UA|UG|UK|UM|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|YE|YT|YU|ZA|ZM|ZW)([\S]*))/i";

But its a bit screwed up because it also matches "example.php" which I dont want. and something like abc...test. I want it to match example.com though. and www.example.com as well as http://example.com.

It just needs a slight tweak at the end but I am not sure what. (there should be a slash after the any domain name which it is not checking for right now and it is only checking \S)

thank you for your time.

Upvotes: 27

Answers (14)

Miko Trueman

Reputation: 61

This will get any url in its entirety, including ?= and #/ if they exist:

/[A-Za-z]+:\/\/[A-Za-z0-9\-_]+\.[A-Za-z0-9\-_:%&;\?\#\/.=]+/g

Upvotes: 6

keshav gaur

Reputation: 1

[ftp:\/\/www\/.-https:\/\/-http:\/\/][a-zA-Z0-9u00a1-uffff0]{1,3}[^ ]{1,1000}

This works fine for me in js

var regex = new RegExp('[ftp:\/\/www\/.-https:\/\/-http:\/\/][a-zA-Z0-9u00a1-uffff0]{1,3}[^ ]{1,1000}');
regex.exec('https://www.youtube.com/watch?v=FM7MFYoylVs&feature=youtu.be&t=20s');

Upvotes: -1

pragma

Reputation: 1400

Try Regexy::Web::Url

r = Regexy::Web::Url.new # matches 'http://foo.com', 'www.foo.com' and 'foo.com'

Upvotes: 0

aminhotob

Reputation: 1066

I think this is simple and efficient /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

Upvotes: 1

B Seven

Reputation: 45941

This question was surprisingly difficult to find an answer for. The regexes I found were too complicated to understand, and anything more that a regex is overkill and too difficult to implement.

Finally came up with:

/(\S+\.(com|net|org|edu|gov)(\/\S+)?)/

Works with http://example.com, https://example.com, example.com, http://example.com/foo.

Explanation:

Looks for .com, etc.
Matches everything before it up to the space
Matches everything after it up to the space

Upvotes: 8

Marco

Reputation: 1

This is THE ONE:

_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS

Upvotes: 0

Jerry

Reputation: 1

Regex to match all urls (with www, without www, with http or https, without http or https, includes all 2-6 letter top level domain names [for countries, ex 'ly','us'], ports, query strings, and anchors ['#']). It's not 100% but it is better than anything I have seen posted on the web.

It uses the top level domains from the first answer, combined with other techniques found in my searches. It will return any valid url that has bounds, that is where \b comes into play. Since the trailing '/' is also triggered by \b, the last one, is a match for one or more '?'.

/\b((http(s?):\/\/)?([a-z0-9\-]+\.)+(MUSEUM|TRAVEL|AERO|ARPA|ASIA|EDU|GOV|MIL|MOBI|COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|A[CDEFGILMNOQRSTUWXZ]|B[ABDEFGHIJLMNORSTVWYZ]|C[ACDFGHIKLMNORUVXYZ]|D[EJKMOZ]|E[CEGHRSTU]|F[IJKMOR]|G[ABDEFGHILMNPQRSTUWY]|H[KMNRTU]|I[DELMNOQRST]|J[EMOP]|K[EGHIMNPRWYZ]|L[ABCIKRSTUVY]|M[ACDEFGHKLMNOPQRSTUVWXYZ]|N[ACEFGILOPRUZ]|OM|P[AEFGHKLMNRSTWY]|QA|R[EOSUW]|S[ABCDEGHIJKLMNORTUVYZ]|T[CDFGHJKLMNOPRTVWZ]|U[AGKMSYZ]|V[ACEGINU]|W[FS]|Y[ETU]|Z[AMW])(:[0-9]{1,5})?((\/([a-z0-9_\-\.~]*)*)?((\/)?\?[a-z0-9+_\-\.%=&amp;]*)?)?(#[a-zA-Z0-9!$&'()*+.=-_~:@/?]*)?)/gi

Upvotes: 0

SpYk3HH

Reputation: 22580

Just to add to things. I know this doesn't fully and directly answer this specific question, but it's the best place I can find to add this info. I wrote a jQuery plug a while back to match urls for similar purpose, however at current state (will update it as time goes on) it will still consider addresses like 'http://abc.php' as valid. However, if there is no http, https, or ftp at url start, it will not return 'valid'. Though I should clarify, this jQuery method returns an object and not just one string or boolean. The object breaks things down and among the breakdown is a .valid boolean. See the full fiddle and test in the link at bottom. If you simply wanna grab the plugin and go, see below:

jQuery Plugin

(function($){$.matchUrl||$.extend({matchUrl:function(c){var b=void 0,d="url,,scheme,,authority,path,,query,,fragment".split(","),e=/^(([^\:\/\?\#]+)\:)?(\/\/([^\/\?\#]*))?([^\?\#]*)(\?([^\#]*))?(\#(.*))?/,a={url:void 0,scheme:void 0,authority:void 0,path:void 0,query:void 0,fragment:void 0,valid:!1};"string"===typeof c&&""!=c&&(b=c.match(e));if("object"===typeof b)for(x in b)d[x]&&""!=d[x]&&(a[d[x]]=b[x]);a.scheme&&a.authority&&(a.valid=!0);return a}});})(jQuery);

jsFiddle with example:

http://jsfiddle.net/SpYk3/e4Ank/

Upvotes: -2

benophobia

Reputation: 771

Changing the end of the regex to (/\S*)?)$ should solve your problem.

To explain what that is doing -

it is looking for / followed by some characters (not whitespace)
this match is optional, ? indicated 0 or 1 times
and finally it should be followed by a end of string (or change it to \b for matching on a word boundary).

Upvotes: 1

Nibalkar

Reputation: 51

(http|www)\S+

Just use this regex to match all url's

Upvotes: -3

Diego Perini

Reputation: 129

Not exactly what the OP asked for but this is a much simpler regular expression that does not need to be updated each time the IANA introduces a new TLD. I believe this is more adequate for most simple needs:

^(?:https?://)?(?:[\w]+\.)(?:\.?[\w]{2,})+$

no list of TLD, localhost is not matched, the number of subparts must be >= 2 and the length of each subpart must be >= 2 (fx: "a.a" will not match but "a.ab" will match).

Upvotes: 12

Boldewyn

Reputation: 82814

$search  = "#^((?#
    the scheme:
  )(?:https?://)(?#
    second level domains and beyond:
  )(?:[\S]+\.)+((?#
    top level domains:
  )MUSEUM|TRAVEL|AERO|ARPA|ASIA|EDU|GOV|MIL|MOBI|(?#
  )COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|(?#
  )A[CDEFGILMNOQRSTUWXZ]|B[ABDEFGHIJLMNORSTVWYZ]|(?#
  )C[ACDFGHIKLMNORUVXYZ]|D[EJKMOZ]|(?#
  )E[CEGHRSTU]|F[IJKMOR]|G[ABDEFGHILMNPQRSTUWY]|(?#
  )H[KMNRTU]|I[DELMNOQRST]|J[EMOP]|(?#
  )K[EGHIMNPRWYZ]|L[ABCIKRSTUVY]|M[ACDEFGHKLMNOPQRSTUVWXYZ]|(?#
  )N[ACEFGILOPRUZ]|OM|P[AEFGHKLMNRSTWY]|QA|R[EOSUW]|(?#
  )S[ABCDEGHIJKLMNORTUVYZ]|T[CDFGHJKLMNOPRTVWZ]|(?#
  )U[AGKMSYZ]|V[ACEGINU]|W[FS]|Y[ETU]|Z[AMW])(?#
    the path, can be there or not:
  )(/[a-z0-9\._/~%\-\+&\#\?!=\(\)@]*)?)$#i";

Just cleaned up a bit. This will match only HTTP(s) addresses, and, as long as you copied all top level domains correctly from IANA, only those standardized (it will not match http://localhost) and with the http:// declared.

Finally you should end with the path part, that will always start with a /, if it is there.

However, I'd suggest to follow Cerebrus: If you're not sure about this, learn regexps in a more gentle way and use proven patterns for complicated tasks.

Cheers,

By the way: Your regexp will also match something.r and something.h (between |TO| and |TR| in your example). I left them out in my version, as I guess it was a typo.

On re-reading the question: Change

  )(?:https?://)(?#

  )(?:https?://)?(?#

(there is a ? extra) to match 'URLs' without the scheme.

Upvotes: 21

Matthieu

Reputation: 2761

$ : The dollar signifies the end of the string.
For example \d*$ will match strings which end with a digit. So you need to add the $!

Upvotes: 0

Bluehorn

Reputation: 3121

Using a single regexp to match an URL string makes the code incredible unreadable. I'd suggest to use parse_url to split the URL into its components (which is not a trivial task), and check each part with a regexp.

Upvotes: 1

Regex to match URL where scheme is optional (without &quot;http&quot;)

Answers (14)

Related Questions

Regex to match URL where scheme is optional (without "http")