Reputation: 3021
I am trying to learn Regular expressions and am trying to replace values in a string with white-spaces using regular expressions to feed it into a tokenizer. The string might contain many punctuations. However, I do not want to replace whitespaces in string which contain an apostrophe/ hyphen within them.
For example,
six-pack => six-pack
He's => He's
This,that => This That
I tried to replace all the punctuations with whitespace initially but that would not work. I tried to replace only those punctuations by specifying the wordboundaries as in
\B[^\p{L}\p{N}\s]+\B|\b[^\p{L}\p{N}\s]+\B|\B[^\p{L}\p{N}\s]+\b
But, I am not able to exclude the hyphen and apostrophe from them.
My guess is that the above regex is also very cumbersome and there should be a better way. Is there any?
So, all I am trying to do is:
Any help is appreciated.
Upvotes: 2
Views: 52
Reputation: 174696
You could use negative lookahead assertion like below,
String s = "six-pack\n"
+ "He's\n"
+ "This,that";
System.out.println(s.replaceAll("(?m)^['-]|['-]$|(?!['-])\\p{Punct}", " "));
Output:
six-pack
He's
This that
Explanation:
(?m)
Multiline Mode^['-]
Matches '
or -
which are at the start.|
OR['-]$
Matches '
or -
which are at the end of the line.|
OR(?!['-])\\p{Punct}
Matches all the punctuations except these two '
or -
. It won't touch the matched [-']
symbols (ie, at the start and end).Upvotes: 0
Reputation: 785058
You can use this lookahead based regex:
(?!((?!^)['-].))\\p{Punct}
Upvotes: 0
Reputation: 48404
You can probably work out a set of punctuation characters that are ok between words, and another set that isn't, then define your regular expression based on that.
For instance:
String[] input = {
"six-pack",// => six-pack
"He's",// => He's
"This,that"// => This That"
};
for (String s: input) {
System.out.println(s.replaceAll("(?<=\\w)[\\p{Punct}&&[^'-]](?=\\w)", " "));
}
Output
six-pack
He's
This that
Note
Here I'm defining the Pattern
by using a character class including all posix for punctuation, preceded and followed by a word character, but negating a character class containing either '
or -
.
Upvotes: 1