Limon
Limon

Reputation: 993

Replacing quotations to sentences

I'm trying to reduce some of the complexity of online text by removing non latin characters + [!?., ]. Most of the characters can be removed without a problem, but for some of them I want specific rules:

Pair of ( and ), pair of " (quotation marks) or a pair of * should convert whatever text inside them to a sentence if it contains more than two words. By converting to a sentence, I simply want to add a full stop at the end. For example:

but *after* I came up with it, I searched and...

to

but after I came up with it, I searched and...

Here I simply want the * removed, as opposed to:

 *buys airplane ticket* IM COMING FOR YOU

to

 buys airplane ticket. IM COMING FOR YOU

So in the first example, the author simply puts emphasis on a word that is part of that sentence, in the second example, the author describes an action that might as well be a sentence on its own. This works similarly with quotation marks, where one word is usually some sort of emphasis or sarcasm, while multiple are a quotation.

Is there a way to do this in regex (Java)?

EDIT: So my general approach requires 2 patterns for each of the case the parathesis, quotation marks and the *. The first step is to handle the multi-words by running replace on \*((\w+ )+\w+)\* to $1. and then replacing all the * to nothing. This works, but I need 6 regex calls for this then. Is there a better way?

Upvotes: 2

Views: 77

Answers (2)

Limon
Limon

Reputation: 993

So my current best approach requires 2*numCases Patterns and looks like this:

static Pattern pattern = Pattern.compile("\\*((\\w+ )+\\w+)\\*");
static Pattern remove = Pattern.compile("\\*");

public static String transform(String str) {
    String sentences = pattern.matcher(str).replaceAll("$1.");
    return remove.matcher(sentences).replaceAll("");
}

Running

System.out.println(transform("but *after* I came up with it, I searched and..."));
System.out.println(transform("*buys airplane ticket* IM COMING FOR YOU"));

Gives the expected

but after I came up with it, I searched and...
buys airplane ticket. IM COMING FOR YOU

Upvotes: 0

tucuxi
tucuxi

Reputation: 17955

The standard Java library has no built-in notion of what a full English phrase looks like (telling white-space apart from letters or punctuation is about as far as it will help you). Additionally,

  • No regular expression can parse English correctly. Regular expressions don't do nesting well.
  • You may have luck using a grammar-checker such as those built into common word-processing software. However, they still have significant error rates.
  • While there may exist NLP Java libraries that implement robust parsing, they will still not understand context, and fail frequently.

So no, you cannot do that with Java, or with any other programming language (unless you have huge resources, NLP experience and training corpora to build from) -- unless you significantly relax the requirement of detecting "whether a sequence of characters could be a stand-alone English sentence".

Upvotes: 2

Related Questions