dazzle
dazzle

Reputation: 1

Help with regular expressions

I have a small piece of code which takes a input string, does the cleanup part(removes special characters like '’\. and replaces any other characters with a space) & then generates a new string.

public class Example
{
    public static void main(String... args)
    {
        charFilter("I.T rocks. It's time to get a job.Come on");

    }

    public static String charFilter(String inText) { 

        String outText="";

        inText = inText.replaceAll("['’\\.]", "");
        outText = inText.replaceAll("[^a-zA-Z0-9- ]", " ");
        System.out.println(outText);
        return outText;
    }

}

The output of the above code is "IT rocks Its time to get a jobCome on". But I need to get an output as "IT rocks Its time to get a job Come on"(job & come should appear as separate words, but I.T should appear as IT) because we can expect the user inputting the data to forget adding a space after the full stop.

Can someone suggest me what approach I need to follow.

Upvotes: 0

Views: 102

Answers (2)

BertV
BertV

Reputation: 81

You will need to use information about the semantics, which is why A.I. is more complicated then regex. Without additional information, a simple regex will not be able to distinguish between what humans consider an abbreviation or an end/start of a sentence.

One possible suggestion, but not a 100% solution, would be to look for single characters followed or separated by a dot. While I can imagine there are sentences ending on a single character and the next one starting with one, it could be a valid solution for many cases. Maybe you can come up with a similar workaround for other special characters, using some knowledge of the input language or subject domain (if any).

A complete generic solution would be to have a human re-read and correct the errors by hand. A regex or other automated substitution will not come close to 100% for all possible text input.

Upvotes: 1

Diego Sevilla
Diego Sevilla

Reputation: 29021

You're substituting the . in the first regular expression, so it won't be substituted by an space in the second regex.

Upvotes: 1

Related Questions