iclinux
iclinux

Reputation: 532

regex is very slow, how to check if a string is only with word chars fast?

I have a function to check if a string(most of the string is only with one CJK char) is only with word chars, and it will be invoked many many times, so the cost is unacceptable, but I don't know how to optimize it, any suggestions?

/*\w is equivalent to the character class [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
 For more details see Unicode TR-18, and bear in mind that the set of characters
 in each class can vary between Unicode releases.*/
private static final Pattern sOnlyWordChars = Pattern.compile("\\w+");

private boolean isOnlyWordChars(String s) {
    return sOnlyWordChars.matcher(s).matches();
}

when s is "3g", or "go_url", or "hao123", isOnlyWordChars(s) should return true.

Upvotes: 3

Views: 398

Answers (4)

stinepike
stinepike

Reputation: 54672

private boolean isOnlyWordChars(String s) {
    char[] chars = s.toCharArray();    
    for (char c : chars) {
        if(!Character.isLetter(c)) {
            return false;
        }
    }    
    return true;
}

A better implementation

public static boolean isAlpha(String str) {
    if (str == null) {
        return false;
    }
    int sz = str.length();
    for (int i = 0; i < sz; i++) {
        if (Character.isLetter(str.charAt(i)) == false) {
            return false;
        }
    }
    return true;
}

Or if you are using Apache Commons, StringUtils.isAlpha(). the second implemenation of the answer is actually from the source code if isAlpha.

UPDATE

HI Sorry for the late reply. I wasn't pretty sure about the speed although I read in several places that loop is faster than regex. To be sure I run the following codes in ideoone and here is the result

for 5000000 iteration

with your codes: 4.99 seconds (runtime error after that so for big data it is not working)

with my first code 2.71 seconds

with my second code 2.52 seconds

for 500000 iteration

with your codes: 1.07 seconds

with my first code 0.36 seconds

with my second code 0.33 seconds

Here is the sample code I used.

N.B. There might be small mistakes. You can play with it to test in different scenario. according to the comment of Jan, I think those are minor things like using private or public. yest condition checking is a nice point.

Upvotes: 4

Stephen C
Stephen C

Reputation: 718718

If you want to do this using regexes, then the most efficient way do it is to change the logic to a negation; i.e. "every character is a letter" becomes "no character is a non-letter".

private static final Pattern pat = Pattern.compile("\\W");

private boolean isOnlyWordChars(String s) {
    return !pat.matcher(s).find();
}

This will test each character at most once ... with no backtracking.

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

The only thing i see is to change your pattern to:

^\\w++$

but i am not a java expert

explanations:

I have added anchors (ie ^ $) that increases the performances of the pattern (the regex engine fails at the first non word character until it encounters the end). I have added a possessive quantifier (ie ++), then the regex engine doesn't matter of backtrack positions and is more fast.

more informations here.

Upvotes: 1

Makoto
Makoto

Reputation: 106389

I think that the chief problem is your pattern.

I was working through an iterative solution, when I noticed that it failed on one of my test strings Supercalifragilisticexpalidociou5. This reason for this: \w+ only cares if there is one or more word characters. It doesn't care if you're not looking at a word character beyond what it's already matched.

To rectify this, use a lookaround:

(?!\W+)(\w+)

The \W+ condition will lock the regex if one or more characters are found to be a non-word character (such as &*()!@!#$).

Upvotes: 1

Related Questions