Flash
Flash

Reputation: 3021

Correct existing regular expression / create a new one

I am trying to learn Regular expressions and am trying to replace values in a string with white-spaces using regular expressions to feed it into a tokenizer. The string might contain many punctuations. However, I do not want to replace whitespaces in string which contain an apostrophe/ hyphen within them.

For example,

six-pack => six-pack
He's => He's
This,that => This That

I tried to replace all the punctuations with whitespace initially but that would not work. I tried to replace only those punctuations by specifying the wordboundaries as in

\B[^\p{L}\p{N}\s]+\B|\b[^\p{L}\p{N}\s]+\B|\B[^\p{L}\p{N}\s]+\b

But, I am not able to exclude the hyphen and apostrophe from them.

My guess is that the above regex is also very cumbersome and there should be a better way. Is there any?

So, all I am trying to do is:

  1. Replace all punctuations with whitespace
  2. Do not do the above if they are hyphen/apostrophe
  3. Do replace if the hyphen/apostrophe does occur at start/end of a word.

Any help is appreciated.

Upvotes: 2

Views: 52

Answers (3)

Avinash Raj
Avinash Raj

Reputation: 174696

You could use negative lookahead assertion like below,

String s = "six-pack\n"
        + "He's\n"
        + "This,that";
System.out.println(s.replaceAll("(?m)^['-]|['-]$|(?!['-])\\p{Punct}", " "));

Output:

six-pack
He's
This that

Explanation:

  • (?m) Multiline Mode
  • ^['-] Matches ' or - which are at the start.
  • | OR
  • ['-]$ Matches ' or - which are at the end of the line.
  • | OR
  • (?!['-])\\p{Punct} Matches all the punctuations except these two ' or - . It won't touch the matched [-'] symbols (ie, at the start and end).

RegEx Demo

Upvotes: 0

anubhava
anubhava

Reputation: 785058

You can use this lookahead based regex:

(?!((?!^)['-].))\\p{Punct}

RegEx Demo

Upvotes: 0

Mena
Mena

Reputation: 48404

You can probably work out a set of punctuation characters that are ok between words, and another set that isn't, then define your regular expression based on that.

For instance:

String[] input = {
    "six-pack",//  => six-pack
    "He's",// => He's
    "This,that"// => This That"
};
for (String s: input) {
    System.out.println(s.replaceAll("(?<=\\w)[\\p{Punct}&&[^'-]](?=\\w)", " "));
}

Output

six-pack
He's
This that

Note

Here I'm defining the Pattern by using a character class including all posix for punctuation, preceded and followed by a word character, but negating a character class containing either ' or -.

Upvotes: 1

Related Questions