Reputation: 3841
We are debugging some old code and came across this statement, anyone know what it's doing?
String value=...
value.toLowerCase(Locale.ENGLISH).split("[^\\w]+");
Upvotes: 1
Views: 646
Reputation: 80383
The answer is that it’s doing a lot of things rather naïvely. Why else would they use a negated character class of a word character [^\w]
for what can more readably be had in a simple \W
? Doesn’t make any sense.
Plus the locale silliness suggests that they must be afraid they’re in Turkey, since I don’t know any other locale but Turkish and Azeri where there is ever a difference in casing. Normally LATIN CAPITAL LETTER I lowercases to LATIN SMALL LETTER I as you would expect, but in Turkic languages it lowercase LATIN SMALL LETTER DOTLESS I.
Even so, it won’t work on right for Unicode unless they use the embedded "(?U)"
flag only available in Java 7. You can’t make \w
and \W
play by Unicode rules just by that silly pointless locale thing. You must use the "(?U)"
, or else, if you are actually compiling the pattern, the UNICODE_CHARACTER_CLASSES
flag. Both of those need Java 7. Before that, Java is worse than merely useless for handling Unicode with regex charclass shortcuts like that. It’s actually misleading, wrong, and harmful.
Otherwise the dumb thing will think that a regular English word like naïvely has two words separated by a nonword sequence. It is super stupid.
So in answer to your question, I don’t think it’s doing what its author thinks it’s doing. I’m guarantee you that it’s broken unless it’s entirely ASCII text. See here for the hellish things that happened before Java 7 and what you had to do to work around them, and see here for some of what Java 7 brings to the table.
Upvotes: 4
Reputation: 141839
Solit the string on each group of non word characters. a word character is a letter, number, or underscore. The string splits on groups of anything else.
Upvotes: 0
Reputation: 723498
It appears to be splitting by substrings of non-word characters (represented by [^\w]
), into words.
Upvotes: 3