Reputation: 3935
I'm trying to extract URL from a piece of string I have different posts that contains URL in their message. I've prepared a pattern to match but it's not working properly. I have asked the same question here but forgot to add this case in that so I'm asking a new question for it.
Tried Pattern
\b(\.?)(?:https?://)?(?:(?i:[a-z]+\.)+)[^\s,]+\b
CODE
for ( $i = 0; $i < $resultcount; $i ++ ) {
$pattern = '%\b(\.?)(?:https?://)?(?:(?i:[a-z]+\.)+)[^\s,]+\b%';
$message = (string)$result[$i]['message'];
preg_match_all($pattern,$message,$match);
print_r($match);
}
A Example of my post like this
"This is just a post to test regex for extracting URL. http://google.com, https://www.youtube.com/watch?v=dlw32af https://instagram.com/oscar/ en.wikipedia.org"
Post may have comma or may not have comma for multiple URLs and also it is possible that a string and url doesn't have any space in between like
sometext.http://google.com
Thank you people :)
Upvotes: 0
Views: 317
Reputation: 48751
This will match strings which are precisely encoded and have formats like an HTTP URL except those fall into IDN categorization:
(?i)(?:https?://[^"'\s<>(){}]++|[a-z0-9](?<=\b.)[a-z0-9-]*+(?:\.[a-z-]{2,}+)++(?=[/?"'()\s]|:\d++|\Z)[^"'\s<>(){}]*+)
So you will not expect
ftp://username:password@ftpserver/folder/
to be matched.
Upvotes: 1
Reputation: 31001
In your initial question you failed to specify that each "word"
(a part of URL) can contain something other than letters.
Note that your regex contains [a-z]
which suggests, that you
want to match only URLs, which have "words" composed entirely
of letters, without any digits, minus chars or underscores.
Try the following regex:
(?:https?:\/\/)?(?i)[a-z][a-z0-9_-]*(?:[.\/](?!http)[a-z0-9_-]+)+\/?(?:\?[^\s,.]+)?
Description:
(?:https?:\/\/)?
- Optional protocol name.(?i)
- Turn on case insensitive option.[a-z][a-z0-9_-]*
- The first "word" of the URL (first a letter,
then any number of letter, digit, underscore or minus chars). (?:[.\/]
- Non-capturing group: Either a dot or a slash.(?!http)
- then negative lookahead, to block cases when URL starting from
http is immediately preceded by a dot (or a slash).[a-z0-9_-]+)+
- then the next "word" (optional, no requirement to start
from a letter), all this (non-capturing group) repeated.\/?
- Optional slash, terminating the part before query string (if any).(?:\?[^\s,.]+)?
- Optional non-capturing group for query string.
It starts from ?
and then a sequence of chars other than space,
comma or dot.The above regex does not match the trailing dot, just as you wish.
Note:
As I tried this regex under regex101.com, I quoted /
chars contained
in it. You probably can omit this quotation.
Following your comment, I changed the regex, that a "word" can contain also digits, underscores and minus chars.
Note also that -
as a first or last char between [...]
stands
for itself (opposite to -
between two other chars, where it means
from - to).
Upvotes: 0