VPeric
VPeric

Reputation: 7461

Efficiently removing specific characters (some punctuation) from Strings in Java?

In Java, what is the most efficient way of removing given characters from a String? Currently, I have this code:

private static String processWord(String x) {
    String tmp;

    tmp = x.toLowerCase();
    tmp = tmp.replace(",", "");
    tmp = tmp.replace(".", "");
    tmp = tmp.replace(";", "");
    tmp = tmp.replace("!", "");
    tmp = tmp.replace("?", "");
    tmp = tmp.replace("(", "");
    tmp = tmp.replace(")", "");
    tmp = tmp.replace("{", "");
    tmp = tmp.replace("}", "");
    tmp = tmp.replace("[", "");
    tmp = tmp.replace("]", "");
    tmp = tmp.replace("<", "");
    tmp = tmp.replace(">", "");
    tmp = tmp.replace("%", "");

    return tmp;
}

Would it be faster if I used some sort of StringBuilder, or a regex, or maybe something else? Yes, I know: profile it and see, but I hope someone can provide an answer of the top of their head, as this is a common task.

Upvotes: 6

Views: 36191

Answers (8)

NDAYISABA Charles
NDAYISABA Charles

Reputation: 124

inputString.replaceAll("[^a-zA-Z0-9]", "");

Upvotes: 0

Tomalak
Tomalak

Reputation: 338228

You could do something like this:

static String RemovePunct(String input) 
{
    char[] output = new char[input.length()];
    int i = 0;

    for (char ch : input.toCharArray())
    {
        if (Character.isLetterOrDigit(ch) || Character.isWhitespace(ch)) 
        {
            output[i++] = ch;
        }        
    }

    return new String(output, 0, i);
}

// ...

String s = RemovePunct("This is (a) test string.");

This will likely perform better than using regular expressions, if you find them to slow for your needs.

However, it could get messy fast if you have a long, distinct list of special characters you'd like to remove. In this case regular expressions are easier to handle.

http://ideone.com/mS8Irl

Upvotes: 5

Reimeus
Reimeus

Reputation: 159784

Although \\p{Punct} will specify a wider range of characters than in the question, it does allow for a shorter replacement expression:

tmp = tmp.replaceAll("\\p{Punct}+", "");

Upvotes: 18

Ray Toal
Ray Toal

Reputation: 88378

Here's a late answer, just for fun.

In cases like this, I would suggest aiming for readability over speed. Of course you can be super-readable but too slow, as in this super-concise version:

private static String processWord(String x) {
    return x.replaceAll("[][(){},.;!?<>%]", "");
}

This is slow because everytime you call this method, the regex will be compiled. So you can pre-compile the regex.

private static final Pattern UNDESIRABLES = Pattern.compile("[][(){},.;!?<>%]");

private static String processWord(String x) {
    return UNDESIRABLES.matcher(x).replaceAll("");
}

This should be fast enough for most purposes, assuming the JVM's regex engine optimizes the character class lookup. This is the solution I would use, personally.

Now without profiling, I wouldn't know whether you could do better by making your own character (actually codepoint) lookup table:

private static final boolean[] CHARS_TO_KEEP = new boolean[];

Fill this once and then iterate, making your resulting string. I'll leave the code to you. :)

Again, I wouldn't dive into this kind of optimization. The code has become too hard to read. Is performance that much of a concern? Also remember that modern languages are JITted and after warming up they will perform better, so use a good profiler.

One thing that should be mentioned is that the example in the original question is highly non-performant because you are creating a whole bunch of temporary strings! Unless a compiler optimizes all that away, that particular solution will perform the worst.

Upvotes: 12

Pshemo
Pshemo

Reputation: 124225

Right now your code will iterate over all characters of tmp and compare them with all possible characters that you want to remove, so it will use
number of tmp characters x number or characters you want to remove comparisons.

To optimize your code you could use short circuit OR || and do something like

StringBuilder sb = new StringBuilder();
for (char c : tmp.toCharArray()) {
    if (!(c == ',' || c == '.' || c == ';' || c == '!' || c == '?'
            || c == '(' || c == ')' || c == '{' || c == '}' || c == '['
            || c == ']' || c == '<' || c == '>' || c == '%'))
        sb.append(c);
}
tmp = sb.toString();

or like this

StringBuilder sb = new StringBuilder();
char[] badChars = ",.;!?(){}[]<>%".toCharArray();

outer: 
for (char strChar : tmp.toCharArray()) {
    for (char badChar : badChars) {
        if (badChar == strChar)
            continue outer;// we skip `strChar` since it is bad character
    }
    sb.append(strChar);
}
tmp = sb.toString();

This way you will iterate over every tmp characters but number of comparisons for that character can decrease if it is not % (because it will be last comparison, if character would be . program would get his result in one comparison).


If I am not mistaken this approach is used with character class ([...]) so maybe try it this way

Pattern p = Pattern.compile("[,.;!?(){}\\[\\]<>%]"); //store it somewhere so 
                                         //you wont need to compile it again
tmp = p.matcher(tmp).replaceAll("");

Upvotes: 0

Ravi K Thapliyal
Ravi K Thapliyal

Reputation: 51711

Use String#replaceAll(String regex, String replacement) as

tmp = tmp.replaceAll("[,.;!?(){}\\[\\]<>%]", "");

System.out.println(
   "f,i.l;t!e?r(e)d {s}t[r]i<n>g%".replaceAll(
                   "[,.;!?(){}\\[\\]<>%]", "")); // prints "filtered string"

Upvotes: 0

jh314
jh314

Reputation: 27792

You can do this:

tmp.replaceAll("\\W", "");

to remove punctuation

Upvotes: -1

dghalbr
dghalbr

Reputation: 109

Strings are immutable so its not good to try and use them very dynamically try using StringBuilder instead of String and use all of its wonderful methods! It will let you do anything you want. Plus yes if you have something your trying to do, figure out the regex for it and it will work a lot better for you.

Upvotes: 1

Related Questions