Reputation: 139

Time Complexity Using Regex

I was Doing A question where i had to calculate the no of words in a string as a part of the problem.

eg: "hi i am a programmer" this should return 5. now i thought of two methods to do this:

Using split

String[] words=messages[i].split("\\s");
int length=words.length;

Using while loop:

int getwordCount(String message)
{
        int result = 1;

        //for(int i=0;i<message.length();i++){
        //  char ch=message.charAt(i);

        for (char ch : message.toCharArray())
        {
            if (ch == ' ') 
                ++result;
        }
        return result; 
}

in some cases the 2nd method was proving more efficient and i was getting better time result what method is better to use and why since the running TC for .split() is O(n) which will be similar to the TC of 2nd method which would be O(n) . Even if i do not discard the use of .toCharArray() which is O(n) the method still gives better result.

The only explanation i can think of was of using the regex \\s. what exactly is going on?

Upvotes: 0

Answers (3)

Luatic

Reputation: 11201

First of all your two methods are both wrong. You're counting the number of space-separated "things" (needn't be words) in the first method rather than counting the number of non-space sequences:

jshell> ("some  words").split(" ")
$2 ==> String[3] { "some", "", "words" }

jshell> (" leading and trailing spacing ").split(" ")
$3 ==> String[5] { "", "leading", "and", "trailing", "spacing" }

jshell> ("").split(" ")
$4 ==> String[1] { "" }

jshell> (" ").split(" ")
$5 ==> String[0] {  }

as you can see the word count does not match the space-delimited content count. Counting the spaces suffers from the same issues, except it will also fail on strings with "only" trailing spacing.

Both the RegEx and the for-loop run in linear time O(n), having to visit each character exactly once; using a RegEx requires first compiling the RegEx, but for such a simple RegEx we can reasonably neglect this. This does not mean that they take the same time to complete though, the constant factor may very well differ. As others have pointed out already the RegEx + splitting obviously incurs a significant overhead - especially as this requires auxiliary space O(n) whereas the simple for loop counting variant requires just constant auxiliary space O(1) to keep track of the count and the current index.

Fixing & improving your code using RegEx is quite easy:

int count = 0;
Matcher matcher = Pattern.compile("\\S+").matcher(message);
while (matcher.find()) count++;

You'll likely want to use a character class like \w (alphanumerics) for words rather than looking for non-space characters \S.

Or if you want to implement this manually using the for-loop as it may be slightly faster:

int count = message.charAt(0) == ' ' ? 0 : 1; // does the text start with a non-space character?
for (int i = 1; i < message.length; i++) {
    // beginning of word: transition space -> non-space
    if (message.charAt(i) != ' ' && message.charAt(i-1) == ' ')
        count++
}

Upvotes: 0

Ralan

Reputation: 804

I'm by no means a code optimization expert but i would asume the second method to be faster, especially in large text block since the first needs to create N number of string in memory where the second works of the char array of the first. If you where to use a regex i think the result would be faster if you use a pattern matcher for the \\s pattern and count the matches in a while loop

Pattern pattern = Pattern.compile("\\s");
Matcher matcher = pattern.matcher(yourString);
int count = 0;
while (matcher.find()) {
    count++;
}

Upvotes: 1

Tim Biegeleisen

Reputation: 522471

When you split the array on whitespaces using regex, the regex engine will have to walk down the string once, and make the splits. This option also requires allocating a new String array and populating it with the individual words. The array step increases the running time as well as the storage requirements.

Your second version, while certainly more verbose, also only requires a single walk down the String. However, the second loop version totally avoids the allocation of a String array and the time/space required to populate it. Therefore, I would expect the second version to outperform the first.

That being said, we don't necessarily use regex because of its performance, but rather the simplicity of code it offers. I would probably always use the first string split version in a production code base, unless super high performance were absolutely required (e.g. in an Android app).

Upvotes: 1

Time Complexity Using Regex

Answers (3)

Related Questions