Reputation: 48640
I am trying to replace all instances of sentence terminators such as '.', '?', and '!', but I do not want to replace strings like "dr." and "mr.".
I have tried the following:
text = text.replaceAll("(?![mr|mrs|ms|dr])(\\s*[\\.\\?\\!]\\s*)", "\n");
...but that does not seem to work. Any suggestions would be appreciated.
private String convertText(String text) {
text = text.replaceAll("\\s+", " ");
text = text.replaceAll("[\n\r\\(\\)\"\\,\\:]", "");
text = text.replaceAll("(?i)(?<!dr|mr|mrs|ms|jr|sr|\\s\\w)(\\s*[\\.\\?\\!\\;](?:\\s+|$))","\r\n");
return text.trim();
}
The code will extract all* compound and single sentences from an excerpt of text, removing all punctuation and extraneous white-space.
*There are some exceptions...
Upvotes: 4
Views: 1812
Reputation: 9664
You need to use negative lookbehind instead of negative lookahead like this
String x = "dr. house.";
System.out.println(x.replaceAll("(?<!mr|mrs|ms|dr)(\\s*[\\.\\?\\!]\\s*)","\n"));
Also the list of mr/dr/ms/mrs
should not be inside character classes.
Upvotes: 2
Reputation: 5867
You're going to need to have a complete list of the letter combinations which are allowed to precede .
. Then, you can replace dr.
and mr.
(and any other allowed combos) with something unique like dr28dsj458sj
and mr28dsj458sj
. Ideally you should check that your temp substitute value exists nowhere else in the document. Then go through and remove all your sentence terminators, then go through again and replace the occurrences of 28dsj458sj
with .
again.
Upvotes: -1