Gil Caplan
Gil Caplan

Reputation: 64

regex, take out punctuation that is not part of a word inside a string

i have this code:

String s="  //wont won't won't ";
String[] w =  s.split("[\\s+\\/,\\.!_\\-?;:]++");

i don't the ' to be removed from won't as it is part of the word. help would be appreciated but //wont i do want // to be removed.

so my question is the following- how do I utilize regex in java to get a certain punctuation not to be removed if its part of a word like "won't" where we have ' , but at the same time keep this-

"[\\s+\\/,\\.!_\\-?;:]++"

Upvotes: 2

Views: 71

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627077

You can use

String[] w = s.split("[\\s+/,.!_\\-?;:]+|\\B'|'\\B");

See the regex demo. Details:

  • [\s+/,.!_\-?;:]+ - one or more whitespaces, +, /, ,, ., !, _, -, ?, ; or :
  • | - or
  • \B' - ' that is at the start of string or immediately preceded with a non-word char
  • | - or
  • '\B - ' that is at the end of string or immediately followed with a non-word char.

See the Java demo:

String s ="  //wont won't won't ";
String[] w = s.split("[\\s+/,.!_\\-?;:]+|\\B'|'\\B");
System.out.println(Arrays.toString(w));
// => [, wont, won't, won't]

You may get rid of the empty entries at the start if you remove all matches at the start of the string first:

String regex = "[\\s+/,.!_\\-?;:]+|\\B'|'\\B";
String[] w2 = s.replaceFirst("^(?:"+regex+")+", "").split(regex);
System.out.println(Arrays.toString(w2));
// => [wont, won't, won't]

Upvotes: 1

Related Questions