Nick Loewen
Nick Loewen

Reputation: 95

Regex: how to match any string until whitespace, or until punctuation followed by whitespace?

I'm trying to write a regular expression which will find URLs in a plain-text string, so that I can wrap them with anchor tags. I know there are expressions already available for this, but I want to create my own, mostly because I want to know how it works.

Since it's not going to break anything if my regex fails, my plan is to write something fairly simple. So far that means: 1) match "www" or "http" at the start of a word 2) keep matching until the word ends.

I can do that, AFAICT. I have this: \b(http|www).?[^\s]+

Which works on foo www.example.com bar http://www.example.com etc.

The problem is that if I give it foo www.example.com, http://www.example.com it thinks that the comma is a part of the URL.

So, if I am to use one expression to do this, I need to change "...and stop when you see whitespace" to "...and stop when you see whitespace or a piece of punctuation right before whitespace". This is what I'm not sure how to do.

At the moment, a solution I'm thinking of running with is just adding another test – matching the URL, and then on the next line moving any sneaky punctuation. This just isn't as elegant.

Note: I am writing this in PHP.

Aside: why does replacing \s with \b in the expression above not seem to work?


ETA:

Thanks everyone!

This is what I eventually ended up with, based on Explosion Pills's advice:

function add_links( $string ) {
    function replace( $arr ) {
        if ( strncmp( "http", $arr[1], 4) == 0 ) {
            return "<a href=$arr[1]>$arr[1]</a>$arr[2]$arr[3]";
        } else {
            return "<a href=" . "http://" . $arr[1] . ">$arr[1]</a>$arr[2]$arr[3]";
        }
    }
return preg_replace_callback( '/\b((?:http|www).+?)((?!\/)[\p{P}]+)?(\s|$)/x', replace, $string );
}

I added a callback so that all of the links would start with http://, and did some fiddling with the way it handles punctuation.

It's probably not the Best way to do things, but it works. I've learned a lot about this in the last little while, but there is still more to learn!

Upvotes: 8

Views: 13029

Answers (4)

stema
stema

Reputation: 93026

You can achieve this with a positive lookahead assertion:

\b(http:|www\.)(?:[^\s,.!?]|[,.!?](?!\s))+

See it here on Regexr.

Means, match anything, but whitespace ,.!? OR match ,.!? when it is not followed by whitespace.

Aside: A word boundary is not a character or a set of characters, you can't put it into a character class. It is a zero width assertion, that is matching on a change from a word character to a non-word character. Here, I believe, \b in a character class is interpreted as the backspace character (the string escape sequence).

Upvotes: 2

Ben
Ben

Reputation: 57277

The problem may lie in the dot, which means "any character" in regex-speak. You'll probably have to escape it:

\b(http|www)\.?[^\s]+

Then, the question mark means 0 or 1 so you've said "an optional dot" which is not what you want (right?):

\b(http|www)\.[^\s]+

Now, it will only match http. and www. so you need to tell what other characters you'll let it accept:

\b(http|www)\.[^\s\w]+

or

\b(http|www)\.[^\sa-zA-Z]+

So now you're saying,

  • at the boundary of a word
  • check for http or www
  • put a dot
  • allow any range a-z or A-Z, don't allow any whitespace character
  • one or more of those

Note - I haven't tested these but they are hopefully correct-ish.


Aside (my take on it) - the \s means 'whitespace'. The \b means 'word boundary'. The [] means 'an allowed character range'. The ^ means 'not'. The + means 'one or more'.

So when you say [^\b]+ you're saying 'don't allow word boundaries in this range of characters, and there must be one or more' and since there's nothing else there > nothing else is allowed > there's not one or more > it probably breaks.

Upvotes: 1

NeverHopeless
NeverHopeless

Reputation: 11233

You should try something like this:

\b(http|www).?[\w\.\/]+

Upvotes: 0

Explosion Pills
Explosion Pills

Reputation: 191779

preg_replace('/
    \b       # Initial word boundary
    (        # Start capture
    (?:      # Non-capture group
    http|www # http or www (alternation)
    )        # end group
    .+?      # reluctant match for at least one character until...
    )        # End capture
    (        # Start capture
    [,.]+    # ...one or more of either a comma or period.
             # add more punctuation as needed
    )?       # End optional capture
    (\s|$) # Followed by either a space character or end of string
    /x', '<a href="\1">\1</a>\2\3'

...is probably what you are going for. I think it's still imperfect, but it should at least work for your needs.

Aside: I think this is because \b matches punctuation too

Upvotes: 11

Related Questions