HowTheF
HowTheF

Reputation: 81

Regex to split when uppercase after alphabetic lowercase char

So I'm trying to split a string with a regex and the split function in java. The regex should split the string when there is a capital letter after a noncapital letter like this

hHere      // -> should split to ["h", "Here"]

I'm trying to split a string like this

String str = "1. Test split hHere and not .Here and /Here";
String[] splitString = str.split("(?=\\w+)((?=[^\\s])(?=\\p{Upper}))");
/* print splitString */
// -> should split to ["1. Test split h", "Here and not .Here and not /Here"]
for(String s : splitString) {  
    System.out.println(s);
}

output I get

1. 
Test split h
Here and not .
Here and /
Here

output I want

1. Test split h
Here and not .Here and not /Here

Just can't figure out the regex to do this

Upvotes: 0

Views: 70

Answers (2)

ctwheels
ctwheels

Reputation: 22837

As per my original comment.

Code

Option 1

This option works with ASCII characters (it will not work for Unicode characters). Basically, this works with English text.

See regex in use here

(?<=[a-z])(?=[A-Z])

Option 2

This option works with Unicode characters. This works with any language.

See regex in use here

(?<=\p{Ll})(?=\p{Lu})

Explanation

Option 1

  • (?<=[a-z]) Positive lookbehind ensuring what precedes is a character in the set a-z (lowercase ASCII character)
  • (?=[A-Z]) Positive lookahead ensuring what follows is a character in the set A-Z (uppercase ASCII character)

Option 2

  • (?<=\p{Ll}) Positive lookbehind ensuring what precedes is a character in the set \p{Ll} (lowercase letter Unicode property/script category)
  • (?=\p{Lu}) Positive lookahead ensuring what follows is a character in the set \p{Lu} (uppercase letter Unicode property/script category)

Upvotes: 1

azro
azro

Reputation: 54168

You may use a easier pattern : (?<=\p{Ll})(?=\p{Lu})

  • (?<= ) ensures that the given pattern will match, ending at the current position in the expression.
  • (?= ) asserts that the given subpattern can be matched here, without consuming characters

  • both does not consume any characters, very important !


str.split("(?<=[a-z])(?=[A-Z])"); old version does not work for other alphabet

Upvotes: 2

Related Questions