罗泽轩
罗泽轩

Reputation: 1673

How to count the number of an Unicode character in a text with C++

I have written a simple code to count the number of different character in a text.This is the code below:

#include <iostream>
#include <fstream>
#include <map>
using namespace std;
const char* filename="text.txt";
int main()
{
    map<char,int> dict;
    fstream f(filename);
    char ch;
    while (f.get(ch))
    {
        if(!f.eof())
            cout<<ch;
        if (!dict[ch])
            dict[ch]=0;
        dict[ch]++;
    }
    f.close();
    cout<<endl;
    for (auto it=dict.begin();it!=dict.end();it++)
    {
        cout<<(*it).first<<":\t"<<(*it).second<<endl;
    }
    system("pause");
}

The program did well in counting ascii character,but it could not work in Unicode character like chinese character.How to solve the problem if I want it able to work in Unicode character?

Upvotes: 1

Views: 1536

Answers (4)

Joe
Joe

Reputation: 6777

First off, what do you want to count? Unicode codepoints or grapheme clusters, i.e., characters in the encoding sense, or characters as perceived by the reader? Also keep in mind that "wide characters" (16 bit characters) are not Unicode characters (UTF-16 is variable length just like UTF-8!).

In any case, get a library such as ICU to do the actual codepoint/cluster iteration. For counting you need to replace the char type in your map with an appropriate type (either 32 bit unsigned int for codepoints, or normalized strings for grapheme clusters, normalization should - again - be taken care of by a library)

ICU: http://icu-project.org

Grapheme clusters: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

Normalization: http://unicode.org/reports/tr15/

Upvotes: 2

James Kanze
James Kanze

Reputation: 154037

If you can compromize and just count code points, it's fairly simple to do directly in UTF-8. Your dictionary, however, will have to be std::map<std::string, int>. Once you've got the first character of a UTF-8:

while ( f.get( ch ) ) {
    static size_t const charLen[] = 
    {
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
          2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
          3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
          4,  4,  4,  4,  4,  4,  4,  4,  5,  5,  5,  5,  6,  6,  0,  0,
    } ;
    int chLen = charLen[ static_cast<unsigned char>( ch ) ];
    if ( chLen <= 0 ) {
        //  error: impossible first character for UTF-8
    }
    std::string codepoint( 1, ch );
    -- chLen;
    while ( chLen != 0 ) {
        if ( !f.get( ch ) ) {
            //  error: file ends in middle of a UTF-8 code point.
        } else if ( (ch & 0xC0) != 0x80 ) {
            //  error: illegal following character in UTF-8
        } else {
            codepoint += ch;
        }
    }
    ++ dict[codepoint];
}

You'll note that most of the code is involved in error handling.

Upvotes: 0

Michael Dorgan
Michael Dorgan

Reputation: 12515

There are wide char versions of everything, though if you wanted to do something very similiar to what you have now and are using a 16-bit version of unicode:

map<short,int> dict;
fstream f(filename);
char ch;
short val;
while (1)
{
    // Beware endian issues here - should work either way for char counting though.
    f.get(ch);
    val = ch;
    f.get(ch);
    val |= ch << 8;

    if(val == 0) break;

    if(!f.eof())
        cout<<val;
    if (!dict[val])
        dict[val]=0;
    dict[val]++;
}
f.close();
cout<<endl;
for (auto it=dict.begin();it!=dict.end();it++)
{
    cout<<(*it).first<<":\t"<<(*it).second<<endl;
}

The above code makes lots of assumptions (all chars 16-bit, even number of bytes in file, etc.), but it should do what you want or at least give you a quick idea of how it could work with wide chars.

Upvotes: 0

D&#233;j&#224; vu
D&#233;j&#224; vu

Reputation: 28850

You need a Unicode library to handle Unicode characters. Coding - say - UTF8 yourself would a harsh task, and reinventing the wheel.

In this Q/A from SO there is a good one mentioned, and you'll find advice from other answers.

Upvotes: 1

Related Questions