arbales
arbales

Reputation: 5536

Detect URLs in text with JavaScript

Does anyone have suggestions for detecting URLs in a set of strings?

arrayOfStrings.forEach(function(string){
  // detect URLs in strings and do something swell,
  // like creating elements with links.
});

Update: I wound up using this regex for link detection… Apparently several years later.

kLINK_DETECTION_REGEX = /(([a-z]+:\/\/)?(([a-z0-9\-]+\.)+([a-z]{2}|aero|arpa|biz|com|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel|local|internal))(:[0-9]{1,5})?(\/[a-z0-9_\-\.~]+)*(\/([a-z0-9_\-\.]*)(\?[a-z0-9+_\-\.%=&]*)?)?(#[a-zA-Z0-9!$&'()*+.=-_~:@/?]*)?)(\s+|$)/gi

The full helper (with optional Handlebars support) is at gist #1654670.

Upvotes: 232

Views: 313250

Answers (15)

Andrew Kang
Andrew Kang

Reputation: 328

You can use a regex like this to extract normal url patterns.

(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})

If you need more sophisticated patterns, use a library like this.

https://github.com/Andrew-Kang-G/url-knife

Upvotes: 2

h0mayun
h0mayun

Reputation: 3621

Based on Crescent Fresh's answer

if you want to detect links with http:// OR without http:// and by www. you can use the following:

function urlify(text) {
    var urlRegex = /(((https?:\/\/)|(www\.))[^\s]+)/g;
    //var urlRegex = /(https?:\/\/[^\s]+)/g;
    return text.replace(urlRegex, function(url,b,c) {
        var url2 = (c == 'www.') ?  'http://' +url : url;
        return '<a href="' +url2+ '" target="_blank">' + url + '</a>';
    }) 
}

Upvotes: 33

Jean K&#225;ssio
Jean K&#225;ssio

Reputation: 11

There is a problem with other people's answers, for example, for those who want to get the text in an event to test if there are URLs (in the case of messaging applications, for example).

Example:

The regex presented here would return https:// only, or also just https://jeankassio

As this was my case, and I couldn't find satisfactory answers, I decided to create my Regex with my average knowledge on the subject, and I arrived at the following result.

/(http|https):\/\/([^.]+[\.][\S]+)/

Explaining the Regex:

He will get:

  • First the HTTP or HTTPS;
  • then characters before the first dot;
  • then capture the point;
  • only after capturing the point, after inserting some more characters will it be captured in the Regex.

This way, it makes it easier for programmers who want to use this Regex in real-time events.

OR ->

/(http|https):\/\/([^.]+[\.][\S]+(\s))/

This Regex will capture only after inserting a space after the link, which might be better for real time events

Upvotes: 1

Crescent Fresh
Crescent Fresh

Reputation: 117028

First you need a good regex that matches urls. This is hard to do. See here, here and here:

...almost anything is a valid URL. There are some punctuation rules for splitting it up. Absent any punctuation, you still have a valid URL.

Check the RFC carefully and see if you can construct an "invalid" URL. The rules are very flexible.

For example ::::: is a valid URL. The path is ":::::". A pretty stupid filename, but a valid filename.

Also, ///// is a valid URL. The netloc ("hostname") is "". The path is "///". Again, stupid. Also valid. This URL normalizes to "///" which is the equivalent.

Something like "bad://///worse/////" is perfectly valid. Dumb but valid.

Anyway, this answer is not meant to give you the best regex but rather a proof of how to do the string wrapping inside the text, with JavaScript.

OK so lets just use this one: /(https?:\/\/[^\s]+)/g

Again, this is a bad regex. It will have many false positives. However it's good enough for this example.

function urlify(text) {
  var urlRegex = /(https?:\/\/[^\s]+)/g;
  return text.replace(urlRegex, function(url) {
    return '<a href="' + url + '">' + url + '</a>';
  })
  // or alternatively
  // return text.replace(urlRegex, '<a href="$1">$1</a>')
}

var text = 'Find me at http://www.example.com and also at http://stackoverflow.com';
var html = urlify(text);

console.log(html)

// html now looks like:
// "Find me at <a href="http://www.example.com">http://www.example.com</a> and also at <a href="http://stackoverflow.com">http://stackoverflow.com</a>"

So in summary, you can try:

$('#pad dl dd').each(function(element) {
    element.innerHTML = urlify(element.innerHTML);
});

Upvotes: 319

djibril mugisho
djibril mugisho

Reputation: 11

Here is a little solution for react app without using any library please note that this method work if the url is not attached to any character

this component will return a paragraph with kink detection !

import React from "react";


interface Props {
    paragraph: string,
}

const REGEX = /^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/gm;

const Paragraph: React.FC<Props> = ({ paragraph }) => {
  
    const paragraphArray = paragraph.split(' ');
    return <div>

        {
            paragraphArray.map((word: any) => {
                return word.match(REGEX) ? (
                    <>
                        <a href={word} className="text-blue-400">{word}</a> {' '}
                    </>
                ) : word + ' '
            })
        }
    </div>;
};

export default LinkParaGraph;



Upvotes: 1

GMKHussain
GMKHussain

Reputation: 4719

Detect URLs in text and make clickable.

const detectURLInText = ( contentElement ) => {
  const elem = document.querySelector(contentElement);
      elem.innerHTML = elem.innerHTML.replace(/(https?:\/\/[^\s]+)/g, `<a class='link' href="$1">$1</a>`)
  return elem
}

detectURLInText( '#myContent');
<div id="myContent">
  Hell world!, detect URLs in text and make clickable.
  IP: https://123.0.1.890:8080  
  Web: https://any-domain.com
</div>

Upvotes: 4

eddyP23
eddyP23

Reputation: 6863

Generic Object Oriented Solution

For people like me that use frameworks like angular that don't allow manipulating DOM directly, I created a function that takes a string and returns an array of url/plainText objects that can be used to create any UI representation that you want.

URL regex

For URL matching I used (slightly adapted) h0mayun regex: /(?:(?:https?:\/\/)|(?:www\.))[^\s]+/g

My function also drops punctuation characters from the end of a URL like . and , that I believe more often will be actual punctuation than a legit URL ending (but it could be! This is not rigorous science as other answers explain well) For that I apply the following regex onto matched URLs /^(.+?)([.,?!'"]*)$/.

Typescript code

    export function urlMatcherInText(inputString: string): UrlMatcherResult[] {
        if (! inputString) return [];

        const results: UrlMatcherResult[] = [];

        function addText(text: string) {
            if (! text) return;

            const result = new UrlMatcherResult();
            result.type = 'text';
            result.value = text;
            results.push(result);
        }

        function addUrl(url: string) {
            if (! url) return;

            const result = new UrlMatcherResult();
            result.type = 'url';
            result.value = url;
            results.push(result);
        }

        const findUrlRegex = /(?:(?:https?:\/\/)|(?:www\.))[^\s]+/g;
        const cleanUrlRegex = /^(.+?)([.,?!'"]*)$/;

        let match: RegExpExecArray;
        let indexOfStartOfString = 0;

        do {
            match = findUrlRegex.exec(inputString);

            if (match) {
                const text = inputString.substr(indexOfStartOfString, match.index - indexOfStartOfString);
                addText(text);

                var dirtyUrl = match[0];
                var urlDirtyMatch = cleanUrlRegex.exec(dirtyUrl);
                addUrl(urlDirtyMatch[1]);
                addText(urlDirtyMatch[2]);

                indexOfStartOfString = match.index + dirtyUrl.length;
            }
        }
        while (match);

        const remainingText = inputString.substr(indexOfStartOfString, inputString.length - indexOfStartOfString);
        addText(remainingText);

        return results;
    }

    export class UrlMatcherResult {
        public type: 'url' | 'text'
        public value: string
    }

Upvotes: 1

Kashan Haider
Kashan Haider

Reputation: 1204

let str = 'https://example.com is a great site'
str.replace(/(https?:\/\/[^\s]+)/g,"<a href='$1' target='_blank' >$1</a>")

Short Code Big Work!...

Result:-

 <a href="https://example.com" target="_blank" > https://example.com </a>

Upvotes: 7

Vedmant
Vedmant

Reputation: 2591

There is existing npm package: url-regex, just install it with yarn add url-regex or npm install url-regex and use as following:

const urlRegex = require('url-regex');

const replaced = 'Find me at http://www.example.com and also at http://stackoverflow.com or at google.com'
  .replace(urlRegex({strict: false}), function(url) {
     return '<a href="' + url + '">' + url + '</a>';
  });

Upvotes: 4

kofifus
kofifus

Reputation: 19315

try this:

function isUrl(s) {
    if (!isUrl.rx_url) {
        // taken from https://gist.github.com/dperini/729294
        isUrl.rx_url=/^(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,}))\.?)(?::\d{2,5})?(?:[/?#]\S*)?$/i;
        // valid prefixes
        isUrl.prefixes=['http:\/\/', 'https:\/\/', 'ftp:\/\/', 'www.'];
        // taken from https://w3techs.com/technologies/overview/top_level_domain/all
        isUrl.domains=['com','ru','net','org','de','jp','uk','br','pl','in','it','fr','au','info','nl','ir','cn','es','cz','kr','ua','ca','eu','biz','za','gr','co','ro','se','tw','mx','vn','tr','ch','hu','at','be','dk','tv','me','ar','no','us','sk','xyz','fi','id','cl','by','nz','il','ie','pt','kz','io','my','lt','hk','cc','sg','edu','pk','su','bg','th','top','lv','hr','pe','club','rs','ae','az','si','ph','pro','ng','tk','ee','asia','mobi'];
    }

    if (!isUrl.rx_url.test(s)) return false;
    for (let i=0; i<isUrl.prefixes.length; i++) if (s.startsWith(isUrl.prefixes[i])) return true;
    for (let i=0; i<isUrl.domains.length; i++) if (s.endsWith('.'+isUrl.domains[i]) || s.includes('.'+isUrl.domains[i]+'\/') ||s.includes('.'+isUrl.domains[i]+'?')) return true;
    return false;
}

function isEmail(s) {
    if (!isEmail.rx_email) {
        // taken from http://stackoverflow.com/a/16016476/460084
        var sQtext = '[^\\x0d\\x22\\x5c\\x80-\\xff]';
        var sDtext = '[^\\x0d\\x5b-\\x5d\\x80-\\xff]';
        var sAtom = '[^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-\\x3c\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+';
        var sQuotedPair = '\\x5c[\\x00-\\x7f]';
        var sDomainLiteral = '\\x5b(' + sDtext + '|' + sQuotedPair + ')*\\x5d';
        var sQuotedString = '\\x22(' + sQtext + '|' + sQuotedPair + ')*\\x22';
        var sDomain_ref = sAtom;
        var sSubDomain = '(' + sDomain_ref + '|' + sDomainLiteral + ')';
        var sWord = '(' + sAtom + '|' + sQuotedString + ')';
        var sDomain = sSubDomain + '(\\x2e' + sSubDomain + ')*';
        var sLocalPart = sWord + '(\\x2e' + sWord + ')*';
        var sAddrSpec = sLocalPart + '\\x40' + sDomain; // complete RFC822 email address spec
        var sValidEmail = '^' + sAddrSpec + '$'; // as whole string

        isEmail.rx_email = new RegExp(sValidEmail);
    }

    return isEmail.rx_email.test(s);
}

will also recognize urls such as google.com , http://www.google.bla , http://google.bla , www.google.bla but not google.bla

Upvotes: 1

Dan Kantor
Dan Kantor

Reputation: 461

This library on NPM looks like it is pretty comprehensive https://www.npmjs.com/package/linkifyjs

Linkify is a small yet comprehensive JavaScript plugin for finding URLs in plain-text and converting them to HTML links. It works with all valid URLs and email addresses.

Upvotes: 31

niazangels
niazangels

Reputation: 2252

Here is what I ended up using as my regex:

var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;

This doesn't include trailing punctuation in the URL. Crescent's function works like a charm :) so:

function linkify(text) {
    var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;
    return text.replace(urlRegex, function(url) {
        return '<a href="' + url + '">' + url + '</a>';
    });
}

Upvotes: 206

tmp.innerText is undefined. You should use tmp.innerHTML

function strip(html) 
    {  
        var tmp = document.createElement("DIV"); 
        tmp.innerHTML = html; 
        var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;   
        return tmp.innerHTML .replace(urlRegex, function(url) {     
        return '\n' + url 
    })

Upvotes: 0

Adam
Adam

Reputation: 12690

I googled this problem for quite a while, then it occurred to me that there is an Android method, android.text.util.Linkify, that utilizes some pretty robust regexes to accomplish this. Luckily, Android is open source.

They use a few different patterns for matching different types of urls. You can find them all here: http://grepcode.com/file/repository.grepcode.com/java/ext/com.google.android/android/2.0_r1/android/text/util/Regex.java#Regex.0WEB_URL_PATTERN

If you're just concerned about url's that match the WEB_URL_PATTERN, that is, urls that conform to the RFC 1738 spec, you can use this:

/((?:(http|https|Http|Https|rtsp|Rtsp):\/\/(?:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,64}(?:\:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,25})?\@)?)?((?:(?:[a-zA-Z0-9][a-zA-Z0-9\-]{0,64}\.)+(?:(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])|(?:biz|b[abdefghijmnorstvwyz])|(?:cat|com|coop|c[acdfghiklmnoruvxyz])|d[ejkmoz]|(?:edu|e[cegrstu])|f[ijkmor]|(?:gov|g[abdefghilmnpqrstuwy])|h[kmnrtu]|(?:info|int|i[delmnoqrst])|(?:jobs|j[emop])|k[eghimnrwyz]|l[abcikrstuvy]|(?:mil|mobi|museum|m[acdghklmnopqrstuvwxyz])|(?:name|net|n[acefgilopruz])|(?:org|om)|(?:pro|p[aefghklmnrstwy])|qa|r[eouw]|s[abcdeghijklmnortuvyz]|(?:tel|travel|t[cdfghjklmnoprtvwz])|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]))|(?:(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])))(?:\:\d{1,5})?)(\/(?:(?:[a-zA-Z0-9\;\/\?\:\@\&\=\#\~\-\.\+\!\*\'\(\)\,\_])|(?:\%[a-fA-F0-9]{2}))*)?(?:\b|$)/gi;

Here is the full text of the source:

"((?:(http|https|Http|Https|rtsp|Rtsp):\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)"
+ "\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_"
+ "\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?"
+ "((?:(?:[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}\\.)+"   // named host
+ "(?:"   // plus top level domain
+ "(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])"
+ "|(?:biz|b[abdefghijmnorstvwyz])"
+ "|(?:cat|com|coop|c[acdfghiklmnoruvxyz])"
+ "|d[ejkmoz]"
+ "|(?:edu|e[cegrstu])"
+ "|f[ijkmor]"
+ "|(?:gov|g[abdefghilmnpqrstuwy])"
+ "|h[kmnrtu]"
+ "|(?:info|int|i[delmnoqrst])"
+ "|(?:jobs|j[emop])"
+ "|k[eghimnrwyz]"
+ "|l[abcikrstuvy]"
+ "|(?:mil|mobi|museum|m[acdghklmnopqrstuvwxyz])"
+ "|(?:name|net|n[acefgilopruz])"
+ "|(?:org|om)"
+ "|(?:pro|p[aefghklmnrstwy])"
+ "|qa"
+ "|r[eouw]"
+ "|s[abcdeghijklmnortuvyz]"
+ "|(?:tel|travel|t[cdfghjklmnoprtvwz])"
+ "|u[agkmsyz]"
+ "|v[aceginu]"
+ "|w[fs]"
+ "|y[etu]"
+ "|z[amw]))"
+ "|(?:(?:25[0-5]|2[0-4]" // or ip address
+ "[0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(?:25[0-5]|2[0-4][0-9]"
+ "|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1]"
+ "[0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}"
+ "|[1-9][0-9]|[0-9])))"
+ "(?:\\:\\d{1,5})?)" // plus option port number
+ "(\\/(?:(?:[a-zA-Z0-9\\;\\/\\?\\:\\@\\&\\=\\#\\~"  // plus option query params
+ "\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?"
+ "(?:\\b|$)";

If you want to be really fancy, you can test for email addresses as well. The regex for email addresses is:

/[a-zA-Z0-9\\+\\.\\_\\%\\-]{1,256}\\@[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}(\\.[a-zA-Z0-9][a-zA-Z0-9\\-]{0,25})+/gi

PS: The top level domains supported by above regex are current as of June 2007. For an up to date list you'll need to check https://data.iana.org/TLD/tlds-alpha-by-domain.txt.

Upvotes: 67

Gautam Sharma
Gautam Sharma

Reputation: 204

Function can be further improved to render images as well:

function renderHTML(text) { 
    var rawText = strip(text)
    var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;   

    return rawText.replace(urlRegex, function(url) {   

    if ( ( url.indexOf(".jpg") > 0 ) || ( url.indexOf(".png") > 0 ) || ( url.indexOf(".gif") > 0 ) ) {
            return '<img src="' + url + '">' + '<br/>'
        } else {
            return '<a href="' + url + '">' + url + '</a>' + '<br/>'
        }
    }) 
} 

or for a thumbnail image that links to fiull size image:

return '<a href="' + url + '"><img style="width: 100px; border: 0px; -moz-border-radius: 5px; border-radius: 5px;" src="' + url + '">' + '</a>' + '<br/>'

And here is the strip() function that pre-processes the text string for uniformity by removing any existing html.

function strip(html) 
    {  
        var tmp = document.createElement("DIV"); 
        tmp.innerHTML = html; 
        var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;   
        return tmp.innerText.replace(urlRegex, function(url) {     
        return '\n' + url 
    })
} 

Upvotes: 7

Related Questions