Reputation: 1681

Regex split TitleCase Word

My regex doesn't really work for splitting a TitleCase word in PHP. Articles without an author should not be affected by the regex.

My current regex: From (\S+\s){2}(?<=[a-z])(?=[A-Z])

Here is my Regex

Input:

From Günther RossmannThis is the article From Harry Gregson-WilliamsAnother article text From Nora WaldstättenSome lorem ipsum stuff From the fantastic architect of the year Text without an author

Expected output:

From Günther Rossman This is the article From Harry Gregson-Williams Another article text From Nora Waldstätten Some lorem ipsum stuff From the fantastic architect of the year Text without an author

Upvotes: 2

Answers (3)

Wiktor Stribiżew

Reputation: 627609

With the {2} quantifier your pattern gets expanded as \S+\s\S+\s but there is no whitespace between the lower- and the uppercase letter.

You may use

'~From\s+(\S+\s\S+)(?![^\p{Lu}])~u'

See the regex demo

Details

From - a literal substring
\s+ - 1+ whitespaces
(\S+\s\S+) - Group 1: one or more non-whitespace chars, 1 whitespace and again 1+ non-whitespace chars
(?![^\p{Lu}]) - followed with an uppercase letter or end of string.

Or, use a more specific one:

'~From\s+(\p{Lu}\p{Ll}*\s+\p{Lu}\p{Ll}*)~u'

Or, to also support apostrophes or hyphens:

From\h+(\p{Lu}\p{Ll}*(?:[\h-']\p{Lu}\p{Ll}*)*)

See this regex demo. Here, \p{Lu} matches an uppercase letter, \p{Ll}* matches 0+ lowercase letters.

Note that for easier access, you may even get rid of the capturing group and use \K operator that omits the text matched so far from the match value:

'~From\h+\K\p{Lu}\p{Ll}*(?:[\h-']\p{Lu}\p{Ll}*)*~u'

See this regex demo.

Note that you should use u modifier when using Unicode property classes like \p{Lu} and Unicode strings.

Upvotes: 2

anubhava

Reputation: 786349

You may use this regex to match title case author names preceded by From:

\bFrom(?:[\h-]+\p{Lu}\p{Ll}*)+

RegEx Demo

RegEx Breakup:

\bFrom: Match From with word boundary
(?:: Start non-capturing group
- [\h-]+: Match 1+ horizontal space or hyphen
- \p{Lu}: Match 1 uppercase unicode letters
- \p{Ll}*: Match 0 or more lower case unicode letters
)+: End non-capturing group. Match 1 or more of this group

Upvotes: 1

ctwheels

Reputation: 22837

Code

See regex in use here

(From \S+\h+\S+(?<=\p{Ll})(?=\p{Lu}))

Results

Input

From Günther RossmannThis is the article
From Harry Gregson-WilliamsAnother article text
From Nora WaldstättenSome lorem ipsum stuff
From the fantastic architect of the year
Text without an author

Output

<b>From Günther Rossmann</b>This is the article
<b>From Harry Gregson-Williams</b>Another article text
<b>From Nora Waldstätten</b>Some lorem ipsum stuff
From the fantastic architect of the year
Text without an author

Explanation

(From \S+\h+\S+(?<=\p{Ll})(?=\p{Lu})) Capture the following into capture group 1
- From Match this literally
- \S+ Match any non-whitespace character one or more times
- \h+ Match any horizontal whitespace character one or more times
- \S+ Match any non-whitespace character one or more times
- (?<=\p{Ll}) Positive lookbehind ensuring what precedes is a lowercase character in any language/script (Unicode)
- (?=\p{Lu}) Positive lookahead ensuring what follows is an uppercase character in any language/script (Unicode)

I use \p{} character classes to ensure any script is matched; since you have two names with Unicode symbols in them.