Rob Hufschmitt
Rob Hufschmitt

Reputation: 409

Regular Expression to find the end of sentences

I am making a regular expression to find the end of sentences in a text. Here for I assume that any sentence can end with either .!? Sometimes though people like two write !!!!!! at the and of their sentence. So I want to replace any repeating dots, exclamation marks or question marks. But I want to allow the use of '...'. How can I include this exception? Please advise, Thanks!

Pattern p = null;
    try {
    //([!?.] with optional spaces), followed by ([!?.] with optional spaces) repeated 1 or more times
        p = Pattern.compile("([!?.]\\s*)([!?.]\\s*)+");
    }
    catch (PatternSyntaxException pex) {
        pex.printStackTrace();
        System.exit(0);
    }

    //get the matcher
    Matcher m = p.matcher(this.sentence);
    int index = 0;
    while(m.find(index))
    {
        System.out.println(this.sentence);
        System.out.println(p.toString());
        String toReplace = sentence.substring(m.start(), m.end());
        toReplace = toReplace.replaceAll("\\.","\\\\.");
        toReplace =toReplace.replaceAll("\\?","\\\\?");
        String replacement = ""+sentence.charAt(m.start());
        this.sentence = this.sentence.replaceAll(toReplace, replacement);
        System.out.println("");
        index = m.end();
        System.out.println(this.sentence);
    }

Upvotes: 5

Views: 4923

Answers (4)

harry
harry

Reputation: 338

I am working on something like this. So far it looks like I can split my paragraphs (grouped based on blank lines between text) into sentences by looking for the characters [.?!] when followed by either a) one or two spaces then a word (not a single letter) with Initial Caps or b) nothing as it is the end of the paragraph. In my case I don't have any embedded quoted text, but that is an case I would want to exclude if I do find some. I am processing legal / financial docs so I am not sure 'NLP' would be helpful; the language is hardly natural. But I may take a look. Creating a suitable RegEx is looking tough, so an NLP approach might save time.

Upvotes: 0

M. Jessup
M. Jessup

Reputation: 8222

The simplest regex solution for the "..." case is just to use a quantified match:

someString.split("(\\.{1,2})|(\\.{4,})|(\\?+)|(!+)");

This is of course disregarding the other edge cases as already mentioned.

Upvotes: 0

Martin Jespersen
Martin Jespersen

Reputation: 26183

The simplest solution to this is usually to first replace all occurrences of the string "..." with some special char that isn't otherwise in the text, for example an ascii control character.

After this replace, replace all the multiple instances of your end-of-sentence characters with singles.

Then find the end of sentences with your end-of-sentence characters + the special char you used to replace "..." (if you want "..." to denote an end of a sentence)

Lastly replace the special char with "..." again.

I am not a java programmer so i can't give you specific java code to do it, but the easiest way for this type of problem is usually multiple split/join statements an not replaces.

so something like:

str.split("...").join("<special char>")

Upvotes: 0

darioo
darioo

Reputation: 47183

Disclaimer: my answer will be off topic (not using regular expressions).

If it's not too heavyweight, try using Apache OpenNLP. NLP means "natural language processing". Check documentation on detecting sentences.

The relevant bit of code is:

String sentences[] = sentenceDetector.sentDetect("  First sentence. Second sentence. ");

You'll get an array of two Strings. First one will be "First sentence.", second one will be "Second sentence.".

There's more code to be written before using aforementioned line of code, but you get the general idea.

Upvotes: 2

Related Questions