sp497
sp497

Reputation: 2491

Count occurrences of each letter in a file?

How to find the occurrence of letters A-Z regardless(ignore case) in a optimized way even if the file size is as large as 4GB or more ? What could be the different implementations possible in C++/C ?

One implementation is :

Pseudocode

A[26]={0}
loop through each character ch in file
If isalpha(ch)
     A[tolower(ch)-'A']+ = 1
End If
end loop

Upvotes: 2

Views: 1597

Answers (2)

leonbloy
leonbloy

Reputation: 75986

Not much optimization left, I think.

  • Instead of computing tolower()-'A' for each element, just count occurrences of each character (in a char[256] accumulator), and do the case-aware computation afterwards (Might be more efficient or not, just try).

  • Be sure to use buffered input (fopen, perhaps assign larger buffer with setvbuf).

Eg:

acum[256]={0}
loop through each character 'c' in file
     acum[c]++
end loop
group counts corresponding to same lowercase/uppercase letters

Also, bear in mind that this assumes ASCII or derived (one octet = one character) encoding.

Upvotes: 9

Jonathan Wood
Jonathan Wood

Reputation: 67283

This is not going to be instantaneous with 4GB. I see know way to do what you are doing much faster.

In addition, your code wouldn't handle tabs, spaces or other characters. You need to use isalpha() and only increment the count if it returns true.

Note that isalpha() is extremely fast. But, again, this code would not be instantaneous with a very large input.

TCHAR a[26] = { 0 };

for (int i = 0; i < length; i++)
{
    if (isalpha(text[i]))
    {
        a[tolower(text[i]) - 'a']++;
    }
}

Upvotes: 1

Related Questions