Mr. Pyramid
Mr. Pyramid

Reputation: 3935

regex for extracting all urls from string excluding period for terminating strings

I'm trying to extract URL from a piece of string I have different posts that contains URL in their message. I've prepared a pattern to match but it's not working properly. I have asked the same question here but forgot to add this case in that so I'm asking a new question for it.

Tried Pattern

\b(\.?)(?:https?://)?(?:(?i:[a-z]+\.)+)[^\s,]+\b

CODE

for ( $i = 0; $i < $resultcount; $i ++ ) {
    $pattern = '%\b(\.?)(?:https?://)?(?:(?i:[a-z]+\.)+)[^\s,]+\b%';
    $message = (string)$result[$i]['message'];
    preg_match_all($pattern,$message,$match);
    print_r($match);
    }

A Example of my post like this

"This is just a post to test regex for extracting URL. http://google.com, https://www.youtube.com/watch?v=dlw32af https://instagram.com/oscar/ en.wikipedia.org"

Post may have comma or may not have comma for multiple URLs and also it is possible that a string and url doesn't have any space in between like

sometext.http://google.com

regexDemo

Thank you people :)

Upvotes: 0

Views: 317

Answers (2)

revo
revo

Reputation: 48751

This will match strings which are precisely encoded and have formats like an HTTP URL except those fall into IDN categorization:

(?i)(?:https?://[^"'\s<>(){}]++|[a-z0-9](?<=\b.)[a-z0-9-]*+(?:\.[a-z-]{2,}+)++(?=[/?"'()\s]|:\d++|\Z)[^"'\s<>(){}]*+)

So you will not expect

ftp://username:password@ftpserver/folder/ 

to be matched.

Live demo

Upvotes: 1

Valdi_Bo
Valdi_Bo

Reputation: 31001

In your initial question you failed to specify that each "word" (a part of URL) can contain something other than letters. Note that your regex contains [a-z] which suggests, that you want to match only URLs, which have "words" composed entirely of letters, without any digits, minus chars or underscores.

Try the following regex:

(?:https?:\/\/)?(?i)[a-z][a-z0-9_-]*(?:[.\/](?!http)[a-z0-9_-]+)+\/?(?:\?[^\s,.]+)?

Description:

  • (?:https?:\/\/)? - Optional protocol name.
  • (?i) - Turn on case insensitive option.
  • [a-z][a-z0-9_-]* - The first "word" of the URL (first a letter, then any number of letter, digit, underscore or minus chars).
  • (?:[.\/] - Non-capturing group: Either a dot or a slash.
  • (?!http) - then negative lookahead, to block cases when URL starting from http is immediately preceded by a dot (or a slash).
  • [a-z0-9_-]+)+ - then the next "word" (optional, no requirement to start from a letter), all this (non-capturing group) repeated.
  • \/? - Optional slash, terminating the part before query string (if any).
  • (?:\?[^\s,.]+)? - Optional non-capturing group for query string. It starts from ? and then a sequence of chars other than space, comma or dot.

The above regex does not match the trailing dot, just as you wish.

Note:

As I tried this regex under regex101.com, I quoted / chars contained in it. You probably can omit this quotation.

Following your comment, I changed the regex, that a "word" can contain also digits, underscores and minus chars.

Note also that - as a first or last char between [...] stands for itself (opposite to - between two other chars, where it means from - to).

Upvotes: 0

Related Questions