Reputation:
I'm writing a function using ICU to parse an Unicode string which consists of kanji numeric character(s) and want to return the integer value of the string.
"五" => 5
"三十一" => 31
"五千九百七十二" => 5972
I'm setting the locale to Locale::getJapan() and using the NumberFormat::parse() to parse the character string. However, whenever I pass it any Kanji characters, the parse() method is returning U_INVALID_FORMAT_ERROR.
Does anyone know if ICU supports Kanji character strings in the NumberFormat::parse() method? I was hoping that since I'm setting the Locale to Japanese that it would be able to parse Kanji numeric values.
Thanks!
#include <iostream>
#include <unicode/numfmt.h>
using namespace std;
int main(int argc, char **argv) {
const Locale &jaLocale = Locale::getJapan();
UErrorCode status = U_ZERO_ERROR;
NumberFormat *nf = NumberFormat::createInstance(jaLocale, status);
UChar number[] = {0x4E94}; // Character for '5' in Japanese '五'
UnicodeString numStr(number);
Formattable formattable;
nf->parse(numStr, formattable, status);
if (U_FAILURE(status)) {
cout << "error parsing as number: " << u_errorName(status) << endl;
return(1);
}
cout << "long value: " << formattable.getLong() << endl;
}
Upvotes: 8
Views: 2288
Reputation: 4350
You can use the ICU Rule Based Number Format (RBNF) module rbnf.h (C++) or for C, in unum.h with the UNUM_SPELLOUT option, both with the "ja" locale for Japanese. Atryom provides a correction to your code for C++: new RuleBasedNumberFormat(URBNF_SPELLOUT,jaLocale, status);
Upvotes: 6
Reputation: 5087
This is actually quite difficult, especially if you start looking at the obsucre kanji for very large numbers.
In perl, there is a very complete implementaion in Lingua::JA::Numbers. It's source might be inspirational if you want to port it to C++.
Upvotes: 0
Reputation: 2165
I created a small perl module to do this a while back. it can convert arabic<=>japanese and though I haven't tested it exhaustively i think it's pretty comprehensive. feel free to improve it.
package kanjiArabic;
use strict;
use warnings;
our $VERSION = "1.00";
use utf8;
our %big = (
十 => 10,百 => 100,千 => 1000,
);
our %bigger = (
万 => 10000,億 => 100000000,
兆 => 1000000000000,京 => 10000000000000000,
垓 => 100000000000000000000,
);
#precompile regexes
our $qr = qr/[0-9]/;
our $bigqr = qr/[十百千]/;
our $biggerqr = qr/[万億兆京垓]/;
#this routine does most of the real work.
sub kanji2arabic{
$_ = shift;
tr/〇一二三四五六七八九/0123456789/;
#optionally precompile for performance boost
s/(?<=${qr})(${bigqr})/\*${1}/g;
s/(?<=${bigqr})(${bigqr})/\+${1}/g;
s/(${bigqr})(?=${qr})/${1}\+/g;
s/(${bigqr})(?=${bigqr})/${1}\+/g;
s/(${bigqr})/${big{$1}}/g;
s/([0-9\+\*]+)/\(${1}\)/g;
s/(? "〇", 1 => "一", 2 => "二", 3 => "三", 4 => "四",
5 => "五", 6 => "六", 7 => "七", 8 => "八", 9 => "九",
);
our %places = (
1 => 10,
2 => 100,
3 => 1000,
4 => 10000,
8 => 100000000,
12 => 1000000000000,
16 => 10000000000000000,
20 => 100000000000000000000,
);
our %abig = (
10 => "十",
100 => "百",
1000 => "千",
10000 => "万",
100000000 => "億",
1000000000000 => "兆",
10000000000000000 => "京",
100000000000000000000 => "垓",
);
our $MAX = 24; #We only support numbers up to 24 digits!
sub arabic2kanji{
my @number = reverse(split(//,$_[0]));
my @kanji;
for(my $i=$#number;$i>=0;$i--){
if( $i==0 ){push(@kanji,$asmall{$number[$i]});}
elsif( $i % 4 == 0 ){
if( $number[$i] !~ m/[01]/ ){
push(@kanji,$asmall{$number[$i]});
}
push(@kanji,$abig{$places{$i}});
}else{
my $p = $i % 4;
if( $number[$i]==0 ){
next;
}elsif( $number[$i]==1 ){
push(@kanji,$abig{$places{$p}});
}else{
push(@kanji,$asmall{$number[$i]});
push(@kanji,$abig{$places{$p}});
}
}
}
return join("",@kanji);
}
sub eval_k2a{
#feed me utf-8!
if($_[0] !~ m/^[〇一二三四五六七八九十百千万億兆京垓]+$/){
print "Error: ".$_[0].
" not a Kanji number.\n" if defined($_[1])&&$_[1]==1;
return -1;
}
my $expression = kanji2arabic($_[0]);
print $expression."\n" if defined($_[1])&&$_[1]==1;
return eval($expression);
}
1;
you'd then call it from another script like so,
#!/usr/bin/perl -w
use strict;
use warnings;
use Encode;
use kanjiArabic;
my $kanji = kanjiArabic::arabic2kanji($ARGV[0]);
print "Kanji: ".encode("utf8",$kanji)."\n";
my $arabic = kanjiArabic::eval_k2a($kanji);
print "Back to arabic...\n";
print "Arabic: ".$arabic."\n";
and use this script like so,
kettle:~/k2a$ ./k2a.pl 5000215
Kanji: 五百万二百十五
Back to arabic...
Arabic: 5000215
rock on.
Upvotes: 3
Reputation: 14121
I was inspired by your question to solve this problem using Python.
If you don't find a C++ solution, it shouldn't be too hard to adapt this to C++.
Upvotes: 1