Vahid Amiri
Vahid Amiri

Reputation: 11117

PHP MySQL search character coding issues

I'm using PDO to connect to a MySQL database. In my connection string I have already added charset=utf8mb4 and all of my databases and tables are utf8mb4_unicode_ci, But I'm facing a problem.

In order to search for entries based on their title on content table I'm using the code below:

SELECT * FROM content WHERE title LIKE '%سيگنالها%'

the keyword is a Persian word. Now the above code returns 1 result which is correct and as expected.

But If I make a form in my PHP app and enter the SAME word either by using a macOS/Windows PC or by using an Android phone I get 0 results.

I tracked this issue down and it seems like even though the words entered by user look exactly the same as the one already in the database, they are in fact NOT the same.

According to this online tool, the decimal character code

for سيگنالها it's: 1587, 1610, 1711, 1606, 1575, 1604, 1607, 1575

While

for سیگنالها it's: 1587, 1740, 1711, 1606, 1575, 1604, 1607, 1575

Did you spot the difference? It's in bold. In fact if you copy both values and past them in here you will see the difference for yourself.

What can I do to solve this annoying problem? I'm using PHP 7 and MariaDB 10.1.

Upvotes: 1

Views: 234

Answers (2)

Andrei
Andrei

Reputation: 1863

They are not the same character, even though they look the same when stringed together and might even have the same meaning.

The first string (1610) is ARABIC LETTER FARSI YEH[1] while the other (1740) is ARABIC LETTER YEH[2].

[1] https://en.wiktionary.org/wiki/%DB%8C [2] https://en.wiktionary.org/wiki/%D9%8A

I also created a simple form for PHP and tested both strings to see if the value sent through $_POST is kept. Result: the value isn't converted.

So what's probably going on is that you're using an Arabic keyboard to produce Farsi text. The recommended solution is some kind of normalization of the input.

See these discussions:

1) https://groups.google.com/forum/embed/?place=forum/persian-computing#!topic/persian-computing/xS-G0qIGS8A

2) https://github.com/Samsung/KnowledgeSharingPlatform/blob/master/sameas/lib/lucene-analyzers-common-5.0.0/org/apache/lucene/analysis/fa/PersianNormalizer.java

3) can't search in farsi text with arabic keyboard on iphone

Upvotes: 1

Akam
Akam

Reputation: 1052

Your first "ي" in the word "سيگنالها" is different character from second word "سیگنالها" which is "ی"

First ي: is ARABIC LETTER YEH (U+064A)

Second ی: is ARABIC LETTER FARSI YEH (U+06CC)

They are different in their Unicode entities, so that they are not match. Please see https://www.key-shortcut.com/en/writing-systems/%EF%BA%95%EF%BA%8F%D8%A2-arabic-alphabet/ for more information.

Upvotes: 1

Related Questions