Mr. Polywhirl
Mr. Polywhirl

Reputation: 48640

Java ReplaceAll Regular Expression With Exclusions

I am trying to replace all instances of sentence terminators such as '.', '?', and '!', but I do not want to replace strings like "dr." and "mr.".

I have tried the following:

text = text.replaceAll("(?![mr|mrs|ms|dr])(\\s*[\\.\\?\\!]\\s*)", "\n");

...but that does not seem to work. Any suggestions would be appreciated.


Edit: After the feedback here and a bit of tweeking this is the working solution to my problem.

private String convertText(String text) {
  text = text.replaceAll("\\s+", " ");
  text = text.replaceAll("[\n\r\\(\\)\"\\,\\:]", "");
  text = text.replaceAll("(?i)(?<!dr|mr|mrs|ms|jr|sr|\\s\\w)(\\s*[\\.\\?\\!\\;](?:\\s+|$))","\r\n");
  return text.trim();
}

The code will extract all* compound and single sentences from an excerpt of text, removing all punctuation and extraneous white-space.
*There are some exceptions...

Upvotes: 4

Views: 1812

Answers (2)

Narendra Yadala
Narendra Yadala

Reputation: 9664

You need to use negative lookbehind instead of negative lookahead like this

String x = "dr. house.";
System.out.println(x.replaceAll("(?<!mr|mrs|ms|dr)(\\s*[\\.\\?\\!]\\s*)","\n"));

Also the list of mr/dr/ms/mrs should not be inside character classes.

Upvotes: 2

The111
The111

Reputation: 5867

You're going to need to have a complete list of the letter combinations which are allowed to precede .. Then, you can replace dr. and mr. (and any other allowed combos) with something unique like dr28dsj458sj and mr28dsj458sj. Ideally you should check that your temp substitute value exists nowhere else in the document. Then go through and remove all your sentence terminators, then go through again and replace the occurrences of 28dsj458sj with . again.

Upvotes: -1

Related Questions