Reputation: 1
I have a small piece of code which takes a input string, does the cleanup part(removes special characters like '’\. and replaces any other characters with a space) & then generates a new string.
public class Example
{
public static void main(String... args)
{
charFilter("I.T rocks. It's time to get a job.Come on");
}
public static String charFilter(String inText) {
String outText="";
inText = inText.replaceAll("['’\\.]", "");
outText = inText.replaceAll("[^a-zA-Z0-9- ]", " ");
System.out.println(outText);
return outText;
}
}
The output of the above code is "IT rocks Its time to get a jobCome on". But I need to get an output as "IT rocks Its time to get a job Come on"(job & come should appear as separate words, but I.T should appear as IT) because we can expect the user inputting the data to forget adding a space after the full stop.
Can someone suggest me what approach I need to follow.
Upvotes: 0
Views: 102
Reputation: 81
You will need to use information about the semantics, which is why A.I. is more complicated then regex. Without additional information, a simple regex will not be able to distinguish between what humans consider an abbreviation or an end/start of a sentence.
One possible suggestion, but not a 100% solution, would be to look for single characters followed or separated by a dot. While I can imagine there are sentences ending on a single character and the next one starting with one, it could be a valid solution for many cases. Maybe you can come up with a similar workaround for other special characters, using some knowledge of the input language or subject domain (if any).
A complete generic solution would be to have a human re-read and correct the errors by hand. A regex or other automated substitution will not come close to 100% for all possible text input.
Upvotes: 1
Reputation: 29021
You're substituting the .
in the first regular expression, so it won't be substituted by an space in the second regex.
Upvotes: 1