BillPull
BillPull

Reputation: 7013

JavaScript Regex to match a URL in a field of text

How can I setup my regex to test to see if a URL is contained in a block of text in javascript. I cant quite figure out the pattern to use to accomplish this

 var urlpattern = new RegExp( "(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?"

 var txtfield = $('#msg').val() /*this is a textarea*/

 if ( urlpattern.test(txtfield) ){
        //do something about it
 }

EDIT:

So the Pattern I have now works in regex testers for what I need it to do but chrome throws an error

  "Invalid regular expression: /(http|ftp|https)://[w-_]+(.[w-_]+)+([w-.,@?^=%&:/~+#]*[w-@?^=%&/~+#])?/: Range out of order in character class"

for the following code:

var urlexp = new RegExp( '(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?' );

Upvotes: 27

Views: 94861

Answers (8)

Dan Levy
Dan Levy

Reputation: 1263

Goal: Extract & Parse all URIs found in an input string

2025 Note: patterns updated up for a write-up article on my site (Added in case anyone wants to learn more about my techniques for building & testing a 100+ char regex.)

Updated on 2024/08, 2021/06, and 2020/11!

Note: This isn't meant to be RFC compliant; NOT meant for validation!

Parsing must isolate protocol, domain, path, query and hash.

2024-12-20 simpler, may include trailing punctuation (114 chars, 👨‍🍳)

([-.a-z0-9]+:\/{1,3})([^-/\.[\](|)\s?][^`\/\s\]?]+)([-_a-z0-9!@$%^&*()=+;/~\.]*)[?]?([^#\s`?]*)[#]?([^#\s'"`\.]*)

2024-12-29 very accurate, uses look-aheads and look-behinds (157 chars, requires regex lookahead support)

([-\.\w]+:\/{2,3})(?!.*[.]{2})(?![-.*\.])((?!.*@\.)[-_\w@^=%&:;~+\.]+(?<![-\.]))(\/[-_\w@^=%&$:;/~+\.]+(?<!\.))?[?]?([-_\w=&@$!|~+]+)*[#]?([-_\w=&@$!|~+]+)*

Example JS code with output - every URL is turned into a 5-part array of its 'parts' (protocol, host, path, query, and hash)

var re = /([-\.\w]+:\/{2,3})(?!.*[.]{2})(?![-.*\.])((?!.*@\.)[-_\w@^=%&:;~+\.]+(?<![-\.]))(\/[-_\w@^=%&$:;/~+\.]+(?<!\.))?[?]?([-_\w=&@$!|~+]+)*[#]?([-_\w=&@$!|~+]+)*/gi;
var str = 'Bob: Hey there, have you checked https://www.facebook.com ?\n(ignore) https://github.com/justsml?tab=activity#top (ignore this too)';
var m;

while ((m = re.exec(str)) !== null) {
    if (m.index === re.lastIndex) {
        re.lastIndex++;
    }
    console.log(m);
}

Will give you the following:

["https://www.facebook.com",
  "https://",
  "www.facebook.com",
  "",
  "",
  ""
]

["https://github.com/justsml?tab=activity#top",
  "https://",
  "github.com",
  "/justsml",
  "tab=activity",
  "top"
]

Upvotes: 6

Tolga İskender
Tolga İskender

Reputation: 188

try this worked for me

/^((ftp|http[s]?):\/\/)?(www\.)([a-z0-9]+)\.[a-z]{2,5}(\.[a-z]{2})?$/

that is so simple and understandable

Upvotes: 0

Code Jockey
Code Jockey

Reputation: 6721

Though escaping the dash characters (which can have a special meaning as character range specifiers when inside a character class) should work, one other method for taking away their special meaning is putting them at the beginning or the end of the class definition.

In addition, \+ and \@ in a character class are indeed interpreted as + and @ respectively by the JavaScript engine; however, the escapes are not necessary and may confuse someone trying to interpret the regex visually.

I would recommend the following regex for your purposes:

(http|ftp|https)://[\w-]+(\.[\w-]+)+([\w.,@?^=%&amp;:/~+#-]*[\w@?^=%&amp;/~+#-])?

this can be specified in JavaScript either by passing it into the RegExp constructor (like you did in your example):

var urlPattern = new RegExp("(http|ftp|https)://[\w-]+(\.[\w-]+)+([\w.,@?^=%&amp;:/~+#-]*[\w@?^=%&amp;/~+#-])?")

or by directly specifying a regex literal, using the // quoting method:

var urlPattern = /(http|ftp|https):\/\/[\w-]+(\.[\w-]+)+([\w.,@?^=%&amp;:\/~+#-]*[\w@?^=%&amp;\/~+#-])?/

The RegExp constructor is necessary if you accept a regex as a string (from user input or an AJAX call, for instance), and might be more readable (as it is in this case). I am fairly certain that the // quoting method is more efficient, and is at certain times more readable. Both work.

I tested your original and this modification using Chrome both on <JSFiddle> and on <RegexLib.com>, using the Client-Side regex engine (browser) and specifically selecting JavaScript. While the first one fails with the error you stated, my suggested modification succeeds. If I remove the h from the http in the source, it fails to match, as it should!

Edit

As noted by @noa in the comments, the expression above will not match local network (non-internet) servers or any other servers accessed with a single word (e.g. http://localhost/... or https://sharepoint-test-server/...). If matching this type of url is desired (which it may or may not be), the following might be more appropriate:

(http|ftp|https)://[\w-]+(\.[\w-]+)*([\w.,@?^=%&amp;:/~+#-]*[\w@?^=%&amp;/~+#-])?

#------changed----here-------------^

<End Edit>

Finally, an excellent resource that taught me 90% of what I know about regex is Regular-Expressions.info - I highly recommend it if you want to learn regex (both what it can do and what it can't)!

Upvotes: 75

Khadijah J Shtayat
Khadijah J Shtayat

Reputation: 200

Try this general regex for many URL format

/(([A-Za-z]{3,9})://)?([-;:&=\+\$,\w]+@{1})?(([-A-Za-z0-9]+\.)+[A-Za-z]{2,3})(:\d+)?((/[-\+~%/\.\w]+)?/?([&?][-\+=&;%@\.\w]+)?(#[\w]+)?)?/g

Upvotes: 1

Toto
Toto

Reputation: 91430

You have to escape the backslash when you are using new RegExp.

Also you can put the dash - at the end of character class to avoid escaping it.

&amp; inside a character class means & or a or m or p or ; , you just need to put & and ; , a, m and p are already match by \w.

So, your regex becomes:

var urlexp = new RegExp( '(http|ftp|https)://[\\w-]+(\\.[\\w-]+)+([\\w-.,@?^=%&:/~+#-]*[\\w@?^=%&;/~+#-])?' );

Upvotes: 2

matthiasmullie
matthiasmullie

Reputation: 2083

I've cleaned up your regex:

var urlexp = new RegExp('(http|ftp|https)://[a-z0-9\-_]+(\.[a-z0-9\-_]+)+([a-z0-9\-\.,@\?^=%&;:/~\+#]*[a-z0-9\-@\?^=%&;/~\+#])?', 'i');

Tested and works just fine ;)

Upvotes: 1

PotatoEngineer
PotatoEngineer

Reputation: 1642

The trouble is that the "-" in the character class (the brackets) is being parsed as a range: [a-z] means "any character between a and z." As Vini-T suggested, you need to escape the "-" characters in the character classes, using a backslash.

Upvotes: 0

Vinit
Vinit

Reputation: 1825

try (http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?

Upvotes: 1

Related Questions