Ian Dunn
Ian Dunn

Reputation: 3680

Searching Geoname database with non-latin characters

I have a copy of the Geonames database stored in a MySQL database, and a PHP application that allows users to search the database for their city. It works fine if they type the city name in English, but I want them to be able to search in their native language.

For example, instead of asking a Japanese speaker to search for Tokyo, they should be able to search for 東京.

The Geonames database contains an alternatenames column with, "alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)."

For example, the alternatenames value for the Tokyo row is Edo,TYO,Tochiu,Tocio,Tokija,Tokijas,Tokio,Tokió,Tokjo,Tokyo,Toquio,Toquio - dong jing,Toquio - æ±äº¬,Tòquio,Tókýó,Tóquio,TÅkyÅ,dokyo,dong jing,dong jing dou,tokeiyw,tokkiyo,tokyo,twkyw,twqyw,Τόκιο,Токио,Токё,Токіо,ÕÕ¸Õ¯Õ«Õ¸,טוקיו,توكيو,توکیو,طوكيو,ܛܘܟÜܘ,ܜܘܟÜܘ,टोकà¥à¤¯à¥‹,டோகà¯à®•à®¿à®¯à¯‹,โตเà¸à¸µà¸¢à¸§,ტáƒáƒ™áƒ˜áƒ,东京,æ±äº¬,æ±äº¬éƒ½,ë„ì¿„.

Those values don't contain 東京 exactly, but I'm guessing that they contain a form of it that has been encoded or converted in some way. So, I assuming that if I perform the same encoding/conversion on my search string, then I'll be able to match the row. For example:

mysql_query( sprintf( "
    SELECT * FROM geoname 
    WHERE 
        MATCH( name, asciiname, alternatenames ) 
        AGAINST ( %s )  
    LIMIT 1",
    iconv( 'UTF-8', 'ASCII', '東京' )
) );

The problem is that I don't know what that conversion would be. I've tried lots of combinations of iconv(), mb_convert_string(), etc, but with no luck.

The MySQL table looks like this:

CREATE TABLE `geoname` (
 `geonameid` int(11) NOT NULL DEFAULT '0',
 `name` varchar(200) DEFAULT NULL,
 `asciiname` varchar(200) DEFAULT NULL,
 `alternatenames` mediumtext,
 `latitude` decimal(10,7) DEFAULT NULL,
 `longitude` decimal(10,7) DEFAULT NULL,
 `fclass` char(1) DEFAULT NULL,
 `fcode` varchar(10) DEFAULT NULL,
 `country` varchar(2) DEFAULT NULL,
 `cc2` varchar(60) DEFAULT NULL,
 `admin1` varchar(20) DEFAULT NULL,
 `admin2` varchar(80) DEFAULT NULL,
 `admin3` varchar(20) DEFAULT NULL,
 `admin4` varchar(20) DEFAULT NULL,
 `population` int(11) DEFAULT NULL,
 `elevation` int(11) DEFAULT NULL,
 `gtopo30` int(11) DEFAULT NULL,
 `timezone` varchar(40) DEFAULT NULL,
 `moddate` date DEFAULT NULL,
 PRIMARY KEY (`geonameid`),
 KEY `timezone` (`timezone`),
 FULLTEXT KEY `namesearch` (`name`,`asciiname`,`alternatenames`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4

Can anyone point me in the right direction?

Upvotes: 2

Views: 251

Answers (2)

miken32
miken32

Reputation: 42716

When I download the Japan file and set up a database like this:

CREATE TABLE geonames (
    geonameid SERIAL,
    name varchar(200),
    asciiname varchar(200),
    alternatenames varchar(10000),
    latitude float,
    longitude float,
    featureclass varchar(1),
    featurecode varchar(10),
    countrycode varchar(2),
    cc2 varchar(200),
    admin1code varchar(20),
    admin2code varchar(80),
    admin3code varchar(20),
    admin4code varchar(20),
    population BIGINT,
    elevation INT,
    dem INT,
    timezone varchar(40),
    modificationdate DATE
    ) CHARSET utf8mb4;

Then I load the data like this:

LOAD DATA INFILE '/tmp/JP.txt' INTO TABLE geonames CHARACTER SET utf8mb4;

And select it like this:

SELECT alternatenames FROM geonames WHERE geonameid=1850147\G

I get this:

*************************** 1. row ***************************
alternatenames: Edo,TYO,Tochiu,Tocio,Tokija,Tokijas,Tokio,Tokió,Tokjo,Tokyo,Toquio,Toquio - dong jing,Toquio - 東京,Tòquio,Tókýó,Tóquio,Tōkyō,dokyo,dong jing,dong jing dou,tokeiyw,tokkiyo,tokyo,twkyw,twqyw,Τόκιο,Токио,Токё,Токіо,Տոկիո,טוקיו,توكيو,توکیو,طوكيو,ܛܘܟܝܘ,ܜܘܟܝܘ,टोक्यो,டோக்கியோ,โตเกียว,ტოკიო,东京,東京,東京都,도쿄

I can also do a search like this:

SELECT name FROM geonames WHERE alternatenames LIKE '%,東京,%';

Which is a long way of saying: Note the charset declaration when I created the table. I believe this is what you failed to do when you created your database.

Upvotes: 3

spencer7593
spencer7593

Reputation: 108400

Recommended reading:

https://www.joelonsoftware.com/articles/Unicode.html

http://kunststube.net/encoding/


In terms of MySQL, what is of critical importance is the characterset of the MySQL connection. That's the characterset that MySQL Server thinks the client is using in its communication.

SHOW VARIABLES LIKE '%characterset%'

If that isn't set right, for example, the client is sending latin1 (ISO-8859-1), but MySQL server thinks it's receiving UTF8, or vice versa, there's potential for mojibake.

Also of importance is the characterset of the alternatenames column.


One issue dealing with multibyte character set is going to be the PHP sprintf function. Many of the string handling functions in PHP have "mutlibyte" equivalents , that correctly handle strings containing multibyte characters.

https://secure.php.net/manual/en/book.mbstring.php

Unfortunately, there is no builtin mb_sprintf function.

For a more detailed description of string handling in PHP including multibyte characters/charactersets:

https://secure.php.net/manual/en/language.types.string.php#language.types.string.details

excerpt:

Ultimately, this means writing correct programs using Unicode depends on carefully avoiding functions that will not work and that most likely will corrupt the data and using instead the functions that do behave correctly, generally from the intl and mbstring extensions. However, using functions that can handle Unicode encodings is just the beginning. No matter the functions the language provides, it is essential to know the Unicode specification.

Also, a google search of "utf8 all the way through" may return some helpful notes. But be aware that this mantra is not a silver bullet or panacea to the issues.


Another possible issue, noted in the MySQL Reference Manual:

https://dev.mysql.com/doc/refman/5.7/en/fulltext-restrictions.html

13.9.5 Full-Text Restrictions

Ideographic languages such as Chinese and Japanese do not have word delimiters. Therefore, the built-in full-text parser cannot determine where words begin and end in these and other such languages.

In MySQL 5.7.6, a character-based ngram full-text parser that supports Chinese, Japanese, and Korean (CJK), and a word-based MeCab parser plugin that supports Japanese are provided for use with InnoDB and MySIAM tables.

Upvotes: 2

Related Questions