Reputation: 1460
I am a bit of a newbie with encoding. I have a php file that allows users to upload csv files.
My issue is that when the file is created using excel for mac if the file contains utf-8 characters such as accented letters, my code will not work properly. Basically it will ignore the accented characters.
The problem occurs only when the file is saved using the Comma separated values
option.
In all the other cases such as file made in windows or using open office or even excel on mac but saving them as 'windows' file do not cause any problem.
mb_detect_encoding
returns false for the file causing troubles.
here is the code:
// say there is the word Nestlé in the file
$content = file_get_contents(addslashes($file_name));
var_dump(mb_detect_encoding($content)); // print false
$data = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
//$data = utf8_encode($content); //doesn't work
var_dump($data); // print Nestl
ini_set('auto_detect_line_endings',TRUE);
// more code here we don't need at the moment
This question gave me some indications: file_get_contents() Breaks Up UTF-8 Characters
Any help or idea on how to solve this problem? Thank you in advance
here is the new bit of code after the response posted by Anthony
$content = file_get_contents(addslashes($file_name));
// i have no control on how the file is generated so i need to to the replace in the code
$content = str_replace(",", "\t", $content);
var_dump($content);
$data = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
$data = mb_convert_encoding($data, 'UTF-16LE', 'UTF-8');
$data = chr(255) . chr(254) . $data;
var_dump($data); // this still print funny characters not the accented letter
Am i doing something wrong?
Upvotes: 2
Views: 3210
Reputation: 1460
All right, thank you Anthony, here is the line that will fix it:
$data = iconv('macintosh', 'UTF-8', $content);
so my final code will look something like this:
enter code here
$content = file_get_contents(addslashes($file_name));
var_dump(mb_detect_encoding($content));
// need to do this for an issue specific to Excel and more common on Excel for Mac
// using excel on mac if the file is saved as csv using the Comma separated values option we need to use iconv and not mb_convert_encoding
// we use mb_detect_encoding because the content of such file returns a false value
if(!mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)){
//$data = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', 'macintosh', true));
$data = iconv('macintosh', 'UTF-8', $content);
}
// deal with known encoding types
else{
$data = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
Upvotes: 0
Reputation: 37065
This is an issue specific to Excel and more common on Excel for Mac, where UTF-8 multibyte characters are not properly displayed. You can confirm by using a different spreadsheet viewer, like Google Sheets.
The workaround for this is:
Use tabs (\t
) instead of commas as delimiter (don't worry, it's still technically a CSV).
After encoding to utf-8, convert the entire csv string to UTF-16LE :
mb_convert_encoding($csv_content, 'UTF-16LE', 'UTF-8');
Prefix the csv string with a little-endian byte-order-mark (LE BOM) :
$csv_content = chr(255) . chr(254) . $csv_content;
And this should do it.
Upvotes: 2