Run
Run

Reputation: 57196

PHP preg_match with regex: only single hyphens and spaces between words continue

I was trying to write an regex that allows single hyphens and single spaces only within words but not at the beginning or at the end of the words.

I thought I have this sorted from the answer I got yesterday, but I just realised there is small error which I don't quite understand,

Why it won't accept the inputs like,

'forum-category-b forum-category-a'
'forum-category-b Counter-terrorism'
'forum-category-a Preventing'
'forum-category-a Preventing Violent'
'forum-category-a International-Research-and-Publications'
'International-Research-and-Publications forum-category-b forum-category-a'

but it takes,

'forum-category-b'
'Counter-terrorism forum-category-a'
'Preventing forum-category-a'
'Preventing Violent forum-category-a'
'International-Research-and-Publications forum-category-b'

Why is that? How can I fix it? It Below is the regex with the initial test, but ideally it should accept all the combination inputs above,

$aWords = array(
    'a',
    '---stack---over---flow---',
    '   stack    over    flow',
    'stack-over-flow',
    'stack over flow',
    'stacoverflow'
);

foreach($aWords as $sWord) {
    if (preg_match('/^(\w+([\s-]\w+)?)+$/', $sWord)) {
        echo 'pass: ' . $sWord . "\n";
    } else {
        echo 'fail: ' . $sWord . "\n";
    }
}

accept/ to reject the input like these below,

---stack---over---flow---
stack-over-flow- stack-over-flow2
   stack    over    flow

Thanks.

Upvotes: 1

Views: 4588

Answers (3)

Ferdinand Beyer
Ferdinand Beyer

Reputation: 67157

Your pattern does not do what you want. Let's break it apart:

^(\w+([\s-]\w+)?)+$

It matches strings that consist solely of one or more sequences of the pattern:

\w+([\s-]\w+)?

...which is a sequence of word characters, followed optionally by one other sequence of word characters, separated by one space or dash character.

In other words, your pattern searches for strings like:

xxx-xxxyyy-yyyzzz zzz

...but you intent to write a pattern that would find:

xxx-xxxxxx-xxxxxx yyy

In your examples, this one is matched:

Counter-terrorism forum-category-a

...but it is interpreted as the following sequence:

(Counter(-terroris)) (m( foru)) (m(-categor) (y(-a))

As you can see, the pattern did not really find the words you are looking for.

This example is not matched:

forum-category-a Preventing Violent

...since the pattern cannot form groups of "word characters, space-or-dash, word-characters" when it encounters a single word character followed by space or dash:

(forum(-categor)) (y(-a)) <Mismatch: Found " " but expected "\w">

If you would add another character to "forum-category-a", say "forum-category-ax", it would match again, since it could split at the "ax":

(forum(-categor)) (y(-a)) (x( Preventin)) (g( Violent))

What you are actually interested in is a pattern like

^(\w+(-\w+)*)(\s\w+(-\w+)*)*$

...which would find a sequence of words that may contain dashes, separated by spaces:

(forum(-category)(-a)) ( Preventing) ( Violent)

By the way, I tested this using a Python script, and while trying to match your pattern against the example string "International-Research-and-Publications forum-category-b forum-category-a", the regular expression engine seemed to run into an infinite loop...

import re
expr = re.compile(r'^(\w+([\s-]\w+)?)+$')
expr.match('International-Research-and-Publications forum-category-b forum-category-a')

Upvotes: 2

user557597
user557597

Reputation:

There should be only one answer to this problem:

/^((?<=\w)[ -]\w|[^ -])+$/

There is only 1 rule as stated \w[ -]\w and thats it. And its on a per character basis granularity, and cannot be anthing else. Add the [^ -] for the rest.

Upvotes: 0

Brad Christie
Brad Christie

Reputation: 101614

the part of your pattern ([\s-]\w+)? is the issue. It's only allowing for one repetition (the trailing ?). Try changing the last ? to * and see if that helps.

Nope, I still believe that's the problem. The original pattern is looking for "word" or "word[space_hyphen]word" repeated 1+ times. Which is weird because the pattern should fall within another match. But switching the question mark worked for me.

Upvotes: 0

Related Questions