Reputation: 777
I have a project that will receive data in any possible language. Right now I'm trying to parse wiki page and get the list of languages and put it into DB. Already on the parsing step I found out that most of the native names are shown with " "(empty squares and stuff) and other strange symbols. The defined charset is UTF-8.
I am not sure how this works and have no idea where to dig further. I couldn't find any information about multi language contents on websites. Should I get like a code of all the symbols to use them? How to make this work?
I need to:
Right now I have some problems with encoding so some text is shown incorrectly as on the image below. What I already have is here(here is only 1 line of a table from wiki):
header('Content-Type: text/html; charset=utf-8');
$html = '<table class="wikitable sortable jquery-tablesorter" id="Table">
<tbody>
<tr>
<td style="background-color:#ACE1AF;width:#ACE1AF;"></td>
<td><a href="/wiki/Northwest_Caucasian_languages" title="Northwest Caucasian languages">Northwest Caucasian</a></td>
<td><a href="/wiki/Abkhazian_language" class="mw-redirect" title="Abkhazian language">Abkhazian</a></td>
<td lang="ab" xml:lang="ab">аҧсуа бызшәа, аҧсшәа</td>
<td><span class="plainlinks"><a rel="nofollow" class="external text" href="http://www.loc.gov/standards/iso639-2/php/langcodes_name.php?iso_639_1=ab">ab</a></span></td>
<td>abk</td>
<td>abk</td>
<td>abk</td>
<td>also known as Abkhaz</td>
</tr>
</tbody><tfoot></tfoot></table>';
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
foreach ($rows as $row)
{
$cols = $row->getElementsByTagName('td');
echo $cols->item(2)->nodeValue.' ';
echo $cols->item(3)->nodeValue.' ';
echo $cols->item(4)->nodeValue.'<br>';
echo '<hr>';
}
But, if I try to output the $html
it shows everything correctly. I use Google Chrome, last version. I need some clues and tips about how this works and how I can make my thing work properly.
Thanks for attention.
Upvotes: 1
Views: 109
Reputation: 298
I think that DOMDocument component can not work correctly with chars not from latin 1 charset.
Change line $dom->loadHTML($html);
to
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
This should help.
More info in the related answer
Upvotes: 1
Reputation: 469
Change the Database, Tables And Columns Collation
to utf8mb4_unicode_520_ci
, Also keep in mind the the Max UNIQUE
VARCHAR
Length is 191
.
As i know PHPMyAdmin
sets the collation to latin1_swedish_ci
as default,
But this collation isn't recommend for multiple languages websites,
UTF8
is made for this reason,
Also at the end of the name ci
here means Case Insensitive
Upvotes: 1