user646584
user646584

Reputation: 3841

What does this regex expression mean in Java?

We are debugging some old code and came across this statement, anyone know what it's doing?

String value=...
value.toLowerCase(Locale.ENGLISH).split("[^\\w]+");

Upvotes: 1

Views: 646

Answers (3)

tchrist
tchrist

Reputation: 80383

The answer is that it’s doing a lot of things rather naïvely. Why else would they use a negated character class of a word character [^\w] for what can more readably be had in a simple \W? Doesn’t make any sense.

Plus the locale silliness suggests that they must be afraid they’re in Turkey, since I don’t know any other locale but Turkish and Azeri where there is ever a difference in casing. Normally LATIN CAPITAL LETTER I lowercases to LATIN SMALL LETTER I as you would expect, but in Turkic languages it lowercase LATIN SMALL LETTER DOTLESS I.

Even so, it won’t work on right for Unicode unless they use the embedded "(?U)" flag only available in Java 7. You can’t make \w and \W play by Unicode rules just by that silly pointless locale thing. You must use the "(?U)", or else, if you are actually compiling the pattern, the UNICODE_CHARACTER_CLASSES flag. Both of those need Java 7. Before that, Java is worse than merely useless for handling Unicode with regex charclass shortcuts like that. It’s actually misleading, wrong, and harmful.

Otherwise the dumb thing will think that a regular English word like naïvely has two words separated by a nonword sequence. It is super stupid.

So in answer to your question, I don’t think it’s doing what its author thinks it’s doing. I’m guarantee you that it’s broken unless it’s entirely ASCII text. See here for the hellish things that happened before Java 7 and what you had to do to work around them, and see here for some of what Java 7 brings to the table.

Upvotes: 4

Paul
Paul

Reputation: 141839

Solit the string on each group of non word characters. a word character is a letter, number, or underscore. The string splits on groups of anything else.

Upvotes: 0

BoltClock
BoltClock

Reputation: 723498

It appears to be splitting by substrings of non-word characters (represented by [^\w]), into words.

Upvotes: 3

Related Questions