nomistic
nomistic

Reputation: 2962

php search with latin basic, but return results with diactrics

I'm running into a complicated situation here, and I'm hoping for a push in the right direction.

I need to allow Basic Latin searches to bring back results with diacritics. This is further complicated by the fact that the data is stored with HTML instead of pure ASCII. I have been making some progress, but have come across two problems.

First: I'm able to do a partial conversion of the data into something marginally useful, using something like this:

$string = 'Véra';
$converted = html_entity_decode($string, ENT_COMPAT, 'UTF-8');
setlocale(LC_ALL, 'en_US.UTF8');
$translit = iconv('UTF-8', 'ASCII//TRANSLIT', $converted);
echo $translit;

This brings back this result: V'era This is a start but what I really need is Vera. I can do a preg_replace on resulting string, but is there a way of just bringing it back without the hyphen? This is only one example; there are a lot more diacritics in the database (e.g. ñ and more). I feel like this has been addressed before (e.g. iconv returns strange results), but there don't appear to be any solutions listed.

Bigger Problem: I need to convert a string such as Vera and be able to bring back results with Véra. as well as results of Vera. However I believe I need to get problem 1 solved first before I can get to this point.

I'm thinking something like if ($translit) { return $string} but I'm a bit unsure of how to handle this.

All help appreciated.

Edit: I'm thinking this might be done easier directly in the database, however I'm running into issues with DQL. I know that there are ways with doing it in SQL with a stored procedure, but with limited access to the database, I'm open any suggestions for dealing with this in Doctrine

Okay, so maybe I'm making this too difficult

All I need is a way of finding entries that have been HTML encoded in the database without having to search with either the specific encoding but also without the diacritic itself. If I search for Jose, it should bring up anything in the database labeled as José

Upvotes: 2

Views: 227

Answers (2)

Levi
Levi

Reputation: 820

Preface: It's not quite clear whether the data to search is already in the database or whether you're just taking advantage of the fact that the database has logic for character comparisons. I'm going to assume that the data source is the DB.

The fact that you're trying to search html raises the question of whether you really want to search HTML or in fact want to search the user-visible text in HTML and strip html tags (What if there is a diacritic in a tag attribute? What if a word is broken with an empty <span>? Should it match? What if it was broken with a <br>?)


MySQL has the notion of both character sets (how characters are encoded) and collations (how characters are compared)

Relevant Documentation:

Assuming your mysql client/terminal is correctly set for UTF8 encoding, then the following demonstrates the effect of overriding the collation (using ß as particularly interesting example)

> SET NAMES 'utf8';
> SELECT
   'ß',
   'ss',
   'ß' = 'ss' COLLATE utf8_unicode_ci AS ss_unicode,
   'ß' = 'ss' COLLATE utf8_general_ci AS ss_general,
   'ß' = 's' COLLATE utf8_general_ci AS s_general;
+----+----+------------+------------+-----------+
| ß  | ss | ss_unicode | ss_general | s_general |
+----+----+------------+------------+-----------+
| ß  | ss |          1 |          0 |         1 |
+----+----+------------+------------+-----------+
1 row in set (0.00 sec)

Note: general is the faster but not-strictly-correct version of the unicode collation -- but even that is wrong if you speak turkish (see: dotted uppercase i)

I would save decoded html in the database and search on this making sure that the collation is set correctly.

  • Confirm that the table/column collation is correct using SHOW CREATE TABLE xxx. Change it manually (ALTER TABLE ...), or use doctrine annotations as per this answer & use doctrine migrations to update (and confirm afterwards with SHOW CREATE TABLE that your version of doctrine respects collation)
  • Confirm that doctrine is configured to use utf8 encoding.

If you just need to override the collation for one particular query (eg you don't have permission to change the DB structure or it will break other code):

  • If you need to map to a doctrine ORM object, use NativeQuery and add COLLATE overrides as per the example above.

  • If you just want the record ID & field then you can use a direct query bypassing the ORM with a COLLATE override

Upvotes: 1

ahmetson
ahmetson

Reputation: 306

You can use REGEX_REPLACE function to strip diactrics in Database, while requesting. Mysql database has no built-in regex_replace function, but you can use User Defined Library, or change library to MariaDB. MariaDB based on Mysql (Migrating data to MariaDB will be easy).

Then in MariaDB you can use queries like:

SELECT * FROM `test` WHERE 'jose' = REGEXP_REPLACE(name, '(&[A-Za-z]*;)', '')
// another variant with PHP variable
SELECT `table`.name FROM `table` WHERE $search = REGEXP_REPLACE(name, '(&[A-Za-z]*;)', '')  

Even phpMyAdmin supports MariaDB. I tested my query on Demo page. It worked pretty well: enter image description here


Or if you want to stay on MySql, add this UDFs:

https://github.com/mysqludf/lib_mysqludf_preg

Upvotes: 0

Related Questions