Reputation: 1673
I have written a simple code to count the number of different character in a text.This is the code below:
#include <iostream>
#include <fstream>
#include <map>
using namespace std;
const char* filename="text.txt";
int main()
{
map<char,int> dict;
fstream f(filename);
char ch;
while (f.get(ch))
{
if(!f.eof())
cout<<ch;
if (!dict[ch])
dict[ch]=0;
dict[ch]++;
}
f.close();
cout<<endl;
for (auto it=dict.begin();it!=dict.end();it++)
{
cout<<(*it).first<<":\t"<<(*it).second<<endl;
}
system("pause");
}
The program did well in counting ascii character,but it could not work in Unicode character like chinese character.How to solve the problem if I want it able to work in Unicode character?
Upvotes: 1
Views: 1536
Reputation: 6777
First off, what do you want to count? Unicode codepoints or grapheme clusters, i.e., characters in the encoding sense, or characters as perceived by the reader? Also keep in mind that "wide characters" (16 bit characters) are not Unicode characters (UTF-16 is variable length just like UTF-8!).
In any case, get a library such as ICU to do the actual codepoint/cluster iteration. For counting you need to replace the char
type in your map
with an appropriate type (either 32 bit unsigned int
for codepoints, or normalized strings for grapheme clusters, normalization should - again - be taken care of by a library)
Grapheme clusters: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
Normalization: http://unicode.org/reports/tr15/
Upvotes: 2
Reputation: 154037
If you can compromize and just count code points, it's fairly
simple to do directly in UTF-8. Your dictionary, however, will
have to be std::map<std::string, int>
. Once you've got the
first character of a UTF-8:
while ( f.get( ch ) ) {
static size_t const charLen[] =
{
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 0, 0,
} ;
int chLen = charLen[ static_cast<unsigned char>( ch ) ];
if ( chLen <= 0 ) {
// error: impossible first character for UTF-8
}
std::string codepoint( 1, ch );
-- chLen;
while ( chLen != 0 ) {
if ( !f.get( ch ) ) {
// error: file ends in middle of a UTF-8 code point.
} else if ( (ch & 0xC0) != 0x80 ) {
// error: illegal following character in UTF-8
} else {
codepoint += ch;
}
}
++ dict[codepoint];
}
You'll note that most of the code is involved in error handling.
Upvotes: 0
Reputation: 12515
There are wide char versions of everything, though if you wanted to do something very similiar to what you have now and are using a 16-bit version of unicode:
map<short,int> dict;
fstream f(filename);
char ch;
short val;
while (1)
{
// Beware endian issues here - should work either way for char counting though.
f.get(ch);
val = ch;
f.get(ch);
val |= ch << 8;
if(val == 0) break;
if(!f.eof())
cout<<val;
if (!dict[val])
dict[val]=0;
dict[val]++;
}
f.close();
cout<<endl;
for (auto it=dict.begin();it!=dict.end();it++)
{
cout<<(*it).first<<":\t"<<(*it).second<<endl;
}
The above code makes lots of assumptions (all chars 16-bit, even number of bytes in file, etc.), but it should do what you want or at least give you a quick idea of how it could work with wide chars.
Upvotes: 0
Reputation: 28850
You need a Unicode library to handle Unicode characters. Coding - say - UTF8 yourself would a harsh task, and reinventing the wheel.
In this Q/A from SO there is a good one mentioned, and you'll find advice from other answers.
Upvotes: 1