Kol256
Kol256

Reputation: 85

Is there a way to count character frequency with an ArrayList similar to a regular array?

Usually when you want to get the frequency of a character you can do something like this:

int[] count = new int[256];
for(int i: arr) count[arr.charAt(i)]++;
// Example if you have 2 'a' characters count[97] (97 being the ascii value of 'a') will return 2

Is there a way to do this with an ArrayList instead?

Upvotes: 1

Views: 528

Answers (1)

Basil Bourque
Basil Bourque

Reputation: 339193

tl;dr

Let's go nuts with the following code, or read on further below for simpler code.

"👋Hello"
        .codePoints()       // Returns `IntStream` of the code point for each character in the string.
        .boxed()            // Converts `int` primitives to `Integer` objects.
        .collect(
                Collectors.groupingBy(
                        Function.identity() ,   // Classification function.
                        TreeMap :: new ,        // Map factory.
                        Collectors.counting()   // Downstream collector.
                )
        )
        .forEach(
                ( codePoint , count ) ->
                        System.out.println( "Code point: " + codePoint + " | Character: " + Character.toString( codePoint ) + " | Count: " + count )
        );
Code point: 72 | Character: H | Count: 1
Code point: 101 | Character: e | Count: 1
Code point: 108 | Character: l | Count: 2
Code point: 111 | Character: o | Count: 1
Code point: 128075 | Character: 👋 | Count: 1

List

As commented, you could use a List implementation such as ArrayList. But likely no benefit. The code would be more complicated, would use more memory, and would likely be less performant. You would have to use Integer objects rather than int primitives.

By the way, you should not be using char. That type has been legacy since Java 2. As a 16-bit value, char is physically incapable of representing most characters.

And your limit of 256 is far too small. Java supports all of the over 140,000 characters defined in Unicode. Those characters are assigned to code point integer numbers ranging over a million. Use the constant Character.MAX_CODE_POINT as your limit.

List< Integer > counts = new ArrayList<>( Character.MAX_CODE_POINT ) ;   

Populate with a zero for each element. A list is initialized with null in each element, unlike an array of int being initialized to zero in each element.

for( int i = 0 ; i <= Character.MAX_CODE_POINT ; i ++ ) 
{
    counts.add( 0 ) ;  
}

Process your input.

String input = "👋Hello" ; 
int[] codePoints = input.codePoints().toArray() ;

for( int i = 0 ; i < codePoints.length ; i ++ ) 
{
    int codePoint = codePoints[ i ] ;
    int count = counts.get( codePoint ) ;
    counts.set( codePoint , count + 1 ) ;
}

We could report the results by dumping counts list to the console. But with over a million elements, that would be cumbersome. Instead, let's filter out all the elements where the count is zero.

for ( int index = 0 ; index < counts.size() ; index++ )
{
    if ( counts.get( index ) != 0 )
    {
        System.out.println(
                index + " ➣ " + counts.get( index )
        );
    }
}

Or, do the same using streams and lambda syntax. Save effect.

IntStream.range( 0 , counts.size() )
        .filter( index -> counts.get( index ) != 0 )
        .mapToObj( index -> index + " ➣ " + counts.get( index ) )
        .forEach( System.out :: println );

When run:

72 ➣ 1
101 ➣ 1
108 ➣ 2
111 ➣ 1
128075 ➣ 1

Map

If using objects and collections, using a Map instead of a List would make much more sense. You would have only as many entries in the map as you have distinct letters in your input, rather than over a million in your list.

A map tracks pairs of key and value. In our problem at hand, we would use the code point number as our key, and the count would be our value.

And if we would want to keep our keys, our code points, in order we would use a NavigableMap.

String input = "👋Hello";

NavigableMap < Integer, Long > codePointFrequency =
        input
                .codePoints()                       // Returns `IntStream` of the code point for each character in the string.
                .boxed()                            // Converts `int` primitives to `Integer` objects.
                .collect(                           
                        Collectors.groupingBy( 
                            Function.identity() ,   // Classification function.
                            TreeMap :: new ,        // Map factory.
                            Collectors.counting()   // Downstream collector. 
                        )
                );

Dump the map to console.

System.out.println( "codePointFrequency = " + codePointFrequency );

When run.

codePointFrequency = {72=1, 101=1, 108=2, 111=1, 128075=1}

Report each character.

codePointFrequency.forEach(
        ( codePoint , count ) ->
                System.out.println( "Code point: " + codePoint + " | Character: " + Character.toString( codePoint ) + " | Count: " + count )
);

When run.

Code point: 72 | Character: H | Count: 1
Code point: 101 | Character: e | Count: 1
Code point: 108 | Character: l | Count: 2
Code point: 111 | Character: o | Count: 1
Code point: 128075 | Character: 👋 | Count: 1

Upvotes: 2

Related Questions