Sputn1k
Sputn1k

Reputation: 51

How to calculate the frequency of a character in a string in C

I have found code that calculate the frequency of a character in a string, however, all of them are using the same line of code and don't explain what it means. Can someone please enlighten me?

Here is an example:

int c = 0, count[26] = {0}, x;

while (string[c] != '\0') {
/** Considering characters from 'a' to 'z' only and ignoring others. */

  if (string[c] >= 'a' && string[c] <= 'z') {
     x = string[c] - 'a';
     count[x]++;
  }

I understand that the loop will iterate through the string until it reaches the end. I also get the if statement, as that only limits it between a and z. However, I have no idea what x = string[c] -'a' is doing, why is it subtracting 'a'? I also don't understand what the purpose of count[26] is.

Here is where I got this program from:

https://www.programmingsimplified.com/c-program-find-characters-frequency

Any help would be great thanks.

Upvotes: 1

Views: 1002

Answers (5)

selbie
selbie

Reputation: 104559

count[26] is the frequency table. count[0] is the number of occurrences of a. count[1] is the number of occurrences of b, etc...

Initialize the count array to all zero values

count[26] = {0}

While not at the end of the string. Remember, C strings always end with a null char (\0).

while (string[c] != '\0') {     

Evaluate if the character at string[c] is between a and z

  if (string[c] >= 'a' && string[c] <= 'z') {

Normalize the ascii value of this character (which will be between 97 and 122) to a value from 0 to 25. a is 97 when evaluated in a math expression.

     x = string[c] - 'a';

Using the x value computed above, use that as an index into the count table Increment whatever value that is in count[x] by 1.

     count[x]++;

What's missing from this code sample is the place where c gets incremented by 1 such that string[c] is referencing the next character in the string.

Upvotes: 1

Andrew Henle
Andrew Henle

Reputation: 1

First, that code depends on the characters 'a' through 'z' being represented consecutively, something that's common but not guaranteed.

Second,

int c = 0, count[26] = {0}, x;

while (string[c] != '\0') {
/** Considering characters from 'a' to 'z' only and ignoring others. */

  if (string[c] >= 'a' && string[c] <= 'z') {
     x = string[c] - 'a';
     count[x]++;
  }

has several issues and I'd say is more clear as

#include <ctype.h>

// multiple variables on one line is ***very*** bug prone
// so don't do it
int c = 0;
int count[26] = {0};

// set the input - unsigned is important
unsigned char *string = ...;

// loop until the character pointed at by string is '\0'
while ( *string )
{
  // islower() returns non-zero only if the unsigned char value
  // passed is a lower-case letter.
  //
  // If the input int value  can't be represented as an unsigned
  // char the results are undefined behavior (because tolower() 
  // actually takes an int argument.)
  //
  // signed char will be sign-extended when passed as an int argument
  if ( islower( *string ) )
  {
     // get the "index" of the lower-case letter
     // a -> 0, b -> 1, z -> 25
     // depends on a-z being consecutive - not necessarily true
     int x = *string - 'a';

     // increment the number of times this lower-case character
     // is in this string
     count[x]++;
  }

  // go to the next character in the string
  string++;
}

Note that there is zero effort on my part to reduce the number of lines used. You gain nothing by cramming code into fewer lines, but you will make the code harder to read and therefore more bug prone.

A better way to count characters in a string:

#include <limits.h>

void countChars( unsigned char *string )
{
    int counts[ UCHAR_MAX ] = { 0 };

    while ( *string )
    {
        counts[ *string ]++;
        string++;
    }
}

If you want to count lower-case characters:

#include <limits.h>

void countLowerCaseChars( unsigned char *string )
{
    int counts[ UCHAR_MAX ] = { 0 };

    while ( *string )
    {
        counts[ tolower( *string ) ]++;
        string++;
    }
}

Upvotes: 0

dreamcrash
dreamcrash

Reputation: 51463

TL;DR Is taken advantage of the ASCII table.

The code only accepts characters from a to z:

if (string[c] >= 'a' && string[c] <= 'z') 

therefore it creates an array with 26 positions (i.e., count[26]) to store the frequency of those same characters. The following

   x = string[c] - 'a';

converts string[c] into an int; fact which can be used to take advantage of the ASCII table.

enter image description here

According to the ASCII table the letters 'a' to 'z' are represented by the int values from 97 to 112, respectively. Therefore, because arrays in C start with 0 we need to shift 97 elements to the left from the value that will be return by string[c], namely:

 x = string[c] - 97;

which can be represented by

 x = string[c] - 'a';

With this trick if string[c] is 'a' then :

 x = 'a' - 'a';

which is converted to x = 97 - 97, then x = 0; Therefore,

count[x]++; is count[0]++;

which increments by 1 the position 0 of the array count, which is "reserved" to the letter 'a'. This same logic applies to all the other letters from 'a' to 'z'.

Bear in mind, however, and quoting Eric Postpischil:

The character codes used in C implementations do not necessarily have all the letters consecutively. ASCII does and is very common, but not required by the standard.

Hence, this solution will work if your encoding is ASCII.

Upvotes: 2

ulix
ulix

Reputation: 373

count[26] is an array of 26 integers, each one represent the count of a lowercase letter from 'a' to 'z' from within string

count[0] is the counter for 'a', count[1] is the counter for 'b' etc...

x = string[c] - 'a' calculates and assign to x the index 0-25 for the char found at string[c]. Simplifyng: remember that 'a' is the integer ascii value 97 decimal. The subtraction of 'a' is to reduce all indexes from 'a' to 'z' to a value from 0 to 25 needed for the count[] array.

Upvotes: 0

Rachid K.
Rachid K.

Reputation: 5211

In the ASCII data base, 'a' to 'z' have consecutive numerical code from 0x61 to 0x7A. cf. man ascii. Hence, if you substract the value of 'a', you get numbers from 0 to 25. This number is an index in count[] table.

Upvotes: 0

Related Questions