Telion
Telion

Reputation: 777

How to manipulate multilingual data in PHP?

I have a project that will receive data in any possible language. Right now I'm trying to parse wiki page and get the list of languages and put it into DB. Already on the parsing step I found out that most of the native names are shown with " "(empty squares and stuff) and other strange symbols. The defined charset is UTF-8.

I am not sure how this works and have no idea where to dig further. I couldn't find any information about multi language contents on websites. Should I get like a code of all the symbols to use them? How to make this work?

I need to:

Right now I have some problems with encoding so some text is shown incorrectly as on the image below. What I already have is here(here is only 1 line of a table from wiki):

header('Content-Type: text/html; charset=utf-8');

$html = '<table class="wikitable sortable jquery-tablesorter" id="Table">
<tbody>
<tr>
<td style="background-color:#ACE1AF;width:#ACE1AF;"></td>
<td><a href="/wiki/Northwest_Caucasian_languages" title="Northwest Caucasian languages">Northwest Caucasian</a></td>
<td><a href="/wiki/Abkhazian_language" class="mw-redirect" title="Abkhazian language">Abkhazian</a></td>
<td lang="ab" xml:lang="ab">аҧсуа бызшәа, аҧсшәа</td>
<td><span class="plainlinks"><a rel="nofollow" class="external text" href="http://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ab">ab</a></span></td>
<td>abk</td>
<td>abk</td>
<td>abk</td>
<td>also known as Abkhaz</td>
</tr>
</tbody><tfoot></tfoot></table>';

$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
    $cols = $row->getElementsByTagName('td');
    echo $cols->item(2)->nodeValue.' ';
    echo $cols->item(3)->nodeValue.' ';
    echo $cols->item(4)->nodeValue.'<br>';
    echo '<hr>';
}

The output looks like this: enter image description here

But, if I try to output the $html it shows everything correctly. I use Google Chrome, last version. I need some clues and tips about how this works and how I can make my thing work properly.

Thanks for attention.

Upvotes: 1

Views: 109

Answers (2)

Artur Babyuk
Artur Babyuk

Reputation: 298

I think that DOMDocument component can not work correctly with chars not from latin 1 charset.

Change line $dom->loadHTML($html); to

$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

This should help.

More info in the related answer

Upvotes: 1

Axon
Axon

Reputation: 469

Change the Database, Tables And Columns Collation to utf8mb4_unicode_520_ci, Also keep in mind the the Max UNIQUE VARCHAR Length is 191.

As i know PHPMyAdmin sets the collation to latin1_swedish_ci as default,

But this collation isn't recommend for multiple languages websites,

UTF8 is made for this reason,

Also at the end of the name ci here means Case Insensitive

Upvotes: 1

Related Questions