Reputation: 993
I'm trying to reduce some of the complexity of online text by removing non latin characters + [!?., ]
. Most of the characters can be removed without a problem, but for some of them I want specific rules:
Pair of (
and )
, pair of "
(quotation marks) or a pair of *
should convert whatever text inside them to a sentence if it contains more than two words. By converting to a sentence, I simply want to add a full stop at the end. For example:
but *after* I came up with it, I searched and...
to
but after I came up with it, I searched and...
Here I simply want the *
removed, as opposed to:
*buys airplane ticket* IM COMING FOR YOU
to
buys airplane ticket. IM COMING FOR YOU
So in the first example, the author simply puts emphasis on a word that is part of that sentence, in the second example, the author describes an action that might as well be a sentence on its own. This works similarly with quotation marks, where one word is usually some sort of emphasis or sarcasm, while multiple are a quotation.
Is there a way to do this in regex (Java)?
EDIT:
So my general approach requires 2 patterns for each of the case the parathesis, quotation marks and the *. The first step is to handle the multi-words by running replace on \*((\w+ )+\w+)\*
to $1.
and then replacing all the *
to nothing. This works, but I need 6 regex calls for this then. Is there a better way?
Upvotes: 2
Views: 77
Reputation: 993
So my current best approach requires 2*numCases Patterns and looks like this:
static Pattern pattern = Pattern.compile("\\*((\\w+ )+\\w+)\\*");
static Pattern remove = Pattern.compile("\\*");
public static String transform(String str) {
String sentences = pattern.matcher(str).replaceAll("$1.");
return remove.matcher(sentences).replaceAll("");
}
Running
System.out.println(transform("but *after* I came up with it, I searched and..."));
System.out.println(transform("*buys airplane ticket* IM COMING FOR YOU"));
Gives the expected
but after I came up with it, I searched and...
buys airplane ticket. IM COMING FOR YOU
Upvotes: 0
Reputation: 17955
The standard Java library has no built-in notion of what a full English phrase looks like (telling white-space apart from letters or punctuation is about as far as it will help you). Additionally,
So no, you cannot do that with Java, or with any other programming language (unless you have huge resources, NLP experience and training corpora to build from) -- unless you significantly relax the requirement of detecting "whether a sequence of characters could be a stand-alone English sentence".
Upvotes: 2