Reputation: 7905
I am at the point where I am banging my head against my desk, to the amusement of my colleagues. I currently have the following regex
(^[\w](( \w+)|(\w*))*[\w]$)|(^\w$)
What I want it to do is match any string which contains only alphanumeric characters, no leading or trailing whitespace and no more than one space between words.
A word in this case is defined as one or more alphanumeric characters.
This matches most of what I want, however from testing it also thinks the second word onwards must be of 2 characters or more in length.
Tests:
ABC - Pass
Type 1 - Fail
Type A - Fail
Hello A - Fail
Hello Wo - Pass
H A B - Fail
H AB - Pass
AB H - Fail
Any ideas where I'm going wrong?
Upvotes: 4
Views: 27675
Reputation: 84353
First, we need to clean up your corpus since they contain dashes. Next, we add a line or two that will definitely fail so we have a sad path for testing. This yields the following corpus:
# /tmp/corpus
ABC
Type 1
Type A
Hello A
Hello Wo
H A B
H AB
AB H
ab $ cd
Next, we use an anchored Perl-compatible regular expression with a POSIX class that only includes alphanumeric values. We use negative lookahead to prevent trailing spaces, but allow a single space between words.
$ pcregrep '^([[:alnum:]]+(?!= $) ?)+$' /tmp/corpus
ABC
Type 1
Type A
Hello A
Hello Wo
H A B
H AB
AB H
As expected, this yields the 8 valid lines you were expecting. Success!
Upvotes: 2
Reputation: 1559
\w would matches _ as well as alphanumerics. So if you don't want to match underscores you'd have to use [a-zA-Z\d] instead.
The following expression should cover your needs:
^[a-zA-Z\d]+(?: [A-Za-z\d]{2,})*$
Alternatively you could use the following if {min,max} repetition is not supported.
^[A-Za-z\d]+(?: [A-Za-z\d][A-Za-z\d]+)*$
We need the {min,max} or double character group because of your requirement of minimum 2 characters from the second word onwards.
If underscores are allowed then the following expressions would be better:
^\w+(?: \w{2,})*$
or without {min,max}:
^\w+(?: \w\w+)*$
Upvotes: 0
Reputation: 5490
Your regex is close. The cause of your two-character problem is here:
(^[\w](( \w+)|(\w*))*[\w]$)|(^\w$)
right here ---^
After matching the group ( \w+)
, i.e. a space followed by one or more \w
, which every word after the first must match because of the space, you then have another mandatory \w
-- this is requiring the final word in the string to have two or more characters. Take that one out and it should be fine:
(^[\w](( \w+)|(\w*))*$)|(^\w$)
A simpler version would be:
^\w+( \w+)*$
Upvotes: 9