Reputation: 1681
My regex doesn't really work for splitting a TitleCase word in PHP. Articles without an author should not be affected by the regex.
My current regex: From (\S+\s){2}(?<=[a-z])(?=[A-Z])
Here is my Regex
Input:
From Günther RossmannThis is the article
From Harry Gregson-WilliamsAnother article text
From Nora WaldstättenSome lorem ipsum stuff
From the fantastic architect of the year
Text without an author
Expected output:
<b>From Günther Rossman</b> This is the article
<br>From Harry Gregson-Williams</b> Another article text
<br>From Nora Waldstätten</b> Some lorem ipsum stuff
From the fantastic architect of the year
Text without an author
Upvotes: 2
Views: 112
Reputation: 626861
With the {2}
quantifier your pattern gets expanded as \S+\s\S+\s
but there is no whitespace between the lower- and the uppercase letter.
You may use
'~From\s+(\S+\s\S+)(?![^\p{Lu}])~u'
See the regex demo
Details
From
- a literal substring\s+
- 1+ whitespaces(\S+\s\S+)
- Group 1: one or more non-whitespace chars, 1 whitespace and again 1+ non-whitespace chars(?![^\p{Lu}])
- followed with an uppercase letter or end of string.Or, use a more specific one:
'~From\s+(\p{Lu}\p{Ll}*\s+\p{Lu}\p{Ll}*)~u'
Or, to also support apostrophes or hyphens:
From\h+(\p{Lu}\p{Ll}*(?:[\h-']\p{Lu}\p{Ll}*)*)
See this regex demo. Here, \p{Lu}
matches an uppercase letter, \p{Ll}*
matches 0+ lowercase letters.
Note that for easier access, you may even get rid of the capturing group and use \K
operator that omits the text matched so far from the match value:
'~From\h+\K\p{Lu}\p{Ll}*(?:[\h-']\p{Lu}\p{Ll}*)*~u'
See this regex demo.
Note that you should use u
modifier when using Unicode property classes like \p{Lu}
and Unicode strings.
Upvotes: 2
Reputation: 785186
You may use this regex to match title case author names preceded by From
:
\bFrom(?:[\h-]+\p{Lu}\p{Ll}*)+
RegEx Breakup:
\bFrom
: Match From
with word boundary(?:
: Start non-capturing group
[\h-]+
: Match 1+
horizontal space or hyphen\p{Lu}
: Match 1
uppercase unicode letters\p{Ll}*
: Match 0
or more lower case unicode letters)+
: End non-capturing group. Match 1
or more of this groupUpvotes: 1
Reputation: 22817
(From \S+\h+\S+(?<=\p{Ll})(?=\p{Lu}))
From Günther RossmannThis is the article
From Harry Gregson-WilliamsAnother article text
From Nora WaldstättenSome lorem ipsum stuff
From the fantastic architect of the year
Text without an author
<b>From Günther Rossmann</b>This is the article
<b>From Harry Gregson-Williams</b>Another article text
<b>From Nora Waldstätten</b>Some lorem ipsum stuff
From the fantastic architect of the year
Text without an author
(From \S+\h+\S+(?<=\p{Ll})(?=\p{Lu}))
Capture the following into capture group 1
From
Match this literally\S+
Match any non-whitespace character one or more times\h+
Match any horizontal whitespace character one or more times\S+
Match any non-whitespace character one or more times(?<=\p{Ll})
Positive lookbehind ensuring what precedes is a lowercase character in any language/script (Unicode)(?=\p{Lu})
Positive lookahead ensuring what follows is an uppercase character in any language/script (Unicode)I use \p{}
character classes to ensure any script is matched; since you have two names with Unicode symbols in them.
Upvotes: 1