coder_learner
coder_learner

Reputation: 95

Counting the occurrences of string in Java using string.split()

I'm new to Java. I thought I would write a program to count the occurrences of a character or a sequence of characters in a sentence. I wrote the following code. But I then saw there are some ready-made options available in Apache Commons.

Anyway, can you look at my code and say if there is any rookie mistake? I tested it for a couple of cases and it worked fine. I can think of one case where if the input is a big text file instead of a small sentence/paragraph, the split() function may end up being problematic since it has to handle a large variable. However this is my guess and would love to have your opinions.

private static void countCharInString() {
    //Get the sentence and the search keyword
    System.out.println("Enter a sentence\n");
    Scanner in = new Scanner(System.in);
    String inputSentence = in.nextLine();
    System.out.println("\nEnter the character to search for\n");
    String checkChar = in.nextLine();
    in.close();

    //Count the number of occurrences
    String[] splitSentence = inputSentence.split(checkChar);
    int countChar = splitSentence.length - 1;
    System.out.println("\nThe character/sequence of characters '" + checkChar + "' appear(s) '" + countChar + "' time(s).");
}

Thank you :)

Upvotes: 1

Views: 2269

Answers (3)

Erwin Bolwidt
Erwin Bolwidt

Reputation: 31279

split is the wrong approach for a number of reasons:

  • String.split takes a regular expression
    • Regular expressions have characters with special meanings, so you cannot use it for all characters (without escaping them). This requires an escaping function.
    • Performance String.split is optimized for single characters. If this were not the case, you would be creating and compiling a regular expression every time. Still, String.split creates one object for the String[] and one object for each String in it, every time that you call it. And you have no use for these objects; all you want to know is the count. Although a future all-knowing HotSpot compiler might be able to optimize that away, the current one does not - it is roughly 10 times as slow as simply counting characters as below.
    • It will not count correctly if you have repeating instances of your checkChar

A better approach is much simpler: just go and count the characters in the string that match your checkChar. If you think about the steps you need to take count characters, that's what you'd end up with by yourself:

public static int occurrences(String str, char checkChar) {
    int count = 0;
    for (int i = 0, l = str.length(); i < l; i++) {
        if (str.charAt(i) == checkChar)
            count++;
    }
    return count;
}

If you want to count the occurrence of multiple characters, it becomes slightly tricker to write with some efficiency because you don't want to create a new substring every time.

public static int occurrences(String str, String checkChars) {
    int count = 0;
    int offset = 0;
    while ((offset = str.indexOf(checkChars, offset)) != -1) {
        offset += checkChars.length();
        count++;
    }
    return count;
}

That's still 10-12 times as fast to match a two-character string than String.split()

Warning: Performance timings are ballpark figures that depends on many circumstances. Since the difference is an order of magnitude, it's safe to say that String.split is slower in general. (Tests performed on jdk 1.8.0-b28 64-bit, using 10 million iterations, verified that results were stable and the same with and without -Xcomp, after performing tests 10 times in same JVM instances.)

Upvotes: 0

spinlok
spinlok

Reputation: 3661

A flaw that I can immediately think of is that if your inputSentence only consists of a single occurrence of checkChar. In this case split() will return an empty array and your count will be -1 instead of 1.

An example interaction:

Enter a sentence

onlyme

Enter the character to search for

onlyme

The character/sequence of characters 'onlyme' appear(s) '-1' time(s).

A better way would be to use the .indexOf() method of String to count the occurrences like this:

while ((i = inputSentence.indexOf(checkChar, i)) != -1) {
    count++;
    i = i + checkChar.length();
}

Upvotes: 1

Bohemian
Bohemian

Reputation: 425043

Because of edge cases, split() is the wrong approach.

Instead, use replaceAll() to remove all other characters then use the length() of what's left to calculate the count:

int count = input.replaceAll(".*?(" + check + "|$)", "$1").length() / check.length();

FYI, the regex created (for example when check = 'xyz'), looks like ".*?(xyz|$)", which means "everything up to and including 'xyz' or end of input", and is replaced by the captured text (either `'xyz' or nothing if it's end of input). This leaves just a string of 0-n copies the check string. Then dividing by the length of check gives you the total.

To protect against the check being null or zero-length (causing a divide-by-zero error), code defensively like this:

int count = check == null || check.isEmpty() ? 0 : input.replaceAll(".*?(" + check + "|$)", "$1").length() / check.length();

Upvotes: 1

Related Questions