Jon Taylor
Jon Taylor

Reputation: 7905

Regex for multiple words split by spaces

I am at the point where I am banging my head against my desk, to the amusement of my colleagues. I currently have the following regex

(^[\w](( \w+)|(\w*))*[\w]$)|(^\w$)

What I want it to do is match any string which contains only alphanumeric characters, no leading or trailing whitespace and no more than one space between words.

A word in this case is defined as one or more alphanumeric characters.

This matches most of what I want, however from testing it also thinks the second word onwards must be of 2 characters or more in length.

Tests:

ABC - Pass
Type 1 - Fail
Type A - Fail
Hello A - Fail
Hello Wo - Pass
H A B - Fail
H AB - Pass
AB H - Fail

Any ideas where I'm going wrong?

Upvotes: 4

Views: 27675

Answers (3)

Todd A. Jacobs
Todd A. Jacobs

Reputation: 84353

Use PCRE with POSIX Class

First, we need to clean up your corpus since they contain dashes. Next, we add a line or two that will definitely fail so we have a sad path for testing. This yields the following corpus:

# /tmp/corpus
ABC
Type 1
Type A
Hello A
Hello Wo
H A B
H AB
AB H
ab $ cd

Next, we use an anchored Perl-compatible regular expression with a POSIX class that only includes alphanumeric values. We use negative lookahead to prevent trailing spaces, but allow a single space between words.

$ pcregrep '^([[:alnum:]]+(?!= $) ?)+$' /tmp/corpus
ABC
Type 1
Type A
Hello A
Hello Wo
H A B
H AB
AB H

As expected, this yields the 8 valid lines you were expecting. Success!

Upvotes: 2

rvalvik
rvalvik

Reputation: 1559

\w would matches _ as well as alphanumerics. So if you don't want to match underscores you'd have to use [a-zA-Z\d] instead.

The following expression should cover your needs:

^[a-zA-Z\d]+(?: [A-Za-z\d]{2,})*$

Alternatively you could use the following if {min,max} repetition is not supported.

^[A-Za-z\d]+(?: [A-Za-z\d][A-Za-z\d]+)*$

We need the {min,max} or double character group because of your requirement of minimum 2 characters from the second word onwards.

If underscores are allowed then the following expressions would be better:

^\w+(?: \w{2,})*$

or without {min,max}:

^\w+(?: \w\w+)*$

Upvotes: 0

Your regex is close. The cause of your two-character problem is here:

(^[\w](( \w+)|(\w*))*[\w]$)|(^\w$)
       right here ---^

After matching the group ( \w+), i.e. a space followed by one or more \w, which every word after the first must match because of the space, you then have another mandatory \w -- this is requiring the final word in the string to have two or more characters. Take that one out and it should be fine:

(^[\w](( \w+)|(\w*))*$)|(^\w$)

A simpler version would be:

^\w+( \w+)*$

Upvotes: 9

Related Questions