Reputation: 2878
Assuming these string definitions:
String lowerStream = "flüßchen";
String upperStream = "FLÜSSCHEN";
String streamPattern = ".*(ss).*";
Using this pattern:
Pattern pattern = Pattern.compile(streamPattern, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
...this assertion passes:
assertThat( pattern.matcher(upperStream).find() ).isTrue()
...and this one fails:
assertThat( pattern.matcher(lowerStream).find() ).isTrue()
...whereas both lowerStream
and upperStream
pass on rubular.com with each of these regexes:
/.*(ss).*/i
/.*(SS).*/i
/.*(ß).*/i
It is also not possible to get a successful comparison using any of String.equalsIgnoreCase()
, String.toLowerCase().equals()
, or String.toUpperCase().equals()
.
Does java's unicode regex only support simple case folding? If so, why is this not explicitly documented?
Upvotes: 0
Views: 102
Reputation: 11020
On my system, it seems to convert lower case correctly to upper:
public class IfTesting {
public static void main( String[] args ) {
String lowerStream = "flüßchen";
String upperStream = "FLÜSSCHEN";
System.out.println( "upper case: " + Arrays.toString( upperStream.getBytes()) );
System.out.println( "lower case to upper: " + Arrays.toString( lowerStream.toUpperCase().getBytes() ) );
}
}
Results in output:
run:
upper case: [70, 76, -61, -100, 83, 83, 67, 72, 69, 78]
lower case to upper: [70, 76, 85, -52, -120, 83, 83, 67, 72, 69, 78]
BUILD SUCCESSFUL (total time: 0 seconds)
And you can see that 'S' (83 decimal) appears in the output. I don't know if this helps, but it appears at some level that Java understands how to convert the characters you provided. OTOH I'm guessing that since 83 is clearly in the ASCII range, it will be converted to a lower case ASCII 's' if you attempt to go the other way. So that might make it better to convert to upper case. You're using lower case 'ss' in your match string.
Upvotes: 1