mikey
mikey

Reputation: 1460

issue converting to utf-8 csv file made with excel on mac

I am a bit of a newbie with encoding. I have a php file that allows users to upload csv files.

My issue is that when the file is created using excel for mac if the file contains utf-8 characters such as accented letters, my code will not work properly. Basically it will ignore the accented characters.

The problem occurs only when the file is saved using the Comma separated values option.

In all the other cases such as file made in windows or using open office or even excel on mac but saving them as 'windows' file do not cause any problem.

mb_detect_encoding returns false for the file causing troubles.

here is the code:

// say there is the word Nestlé in the file
$content = file_get_contents(addslashes($file_name));

var_dump(mb_detect_encoding($content)); // print false


$data  = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
            //$data  = utf8_encode($content);  //doesn't work

var_dump($data); // print Nestl

ini_set('auto_detect_line_endings',TRUE);

// more code here we don't need at the moment

This question gave me some indications: file_get_contents() Breaks Up UTF-8 Characters

Any help or idea on how to solve this problem? Thank you in advance

here is the new bit of code after the response posted by Anthony

$content = file_get_contents(addslashes($file_name));
// i have no control on how the file is generated so i need to to the replace in the code
$content = str_replace(",", "\t",  $content);
var_dump($content);
$data  = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));


$data =  mb_convert_encoding($data, 'UTF-16LE', 'UTF-8');
$data = chr(255) . chr(254) . $data;

var_dump($data); // this still print funny characters not the accented letter

Am i doing something wrong?

Upvotes: 2

Views: 3210

Answers (2)

mikey
mikey

Reputation: 1460

All right, thank you Anthony, here is the line that will fix it:

$data = iconv('macintosh', 'UTF-8', $content);

so my final code will look something like this:

enter code here

$content = file_get_contents(addslashes($file_name));

var_dump(mb_detect_encoding($content));
// need to do this for an issue specific to Excel and more common on Excel for Mac
// using excel on mac if the file is saved as csv using the Comma separated values option we need to use iconv and not mb_convert_encoding
// we use mb_detect_encoding because the content of such file returns a false value
if(!mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)){
     //$data  = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', 'macintosh', true));

      $data = iconv('macintosh', 'UTF-8', $content);


 } 
    // deal with known encoding types
 else{
         $data  = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
 }

Upvotes: 0

Anthony
Anthony

Reputation: 37065

This is an issue specific to Excel and more common on Excel for Mac, where UTF-8 multibyte characters are not properly displayed. You can confirm by using a different spreadsheet viewer, like Google Sheets.

The workaround for this is:

  1. Use tabs (\t) instead of commas as delimiter (don't worry, it's still technically a CSV).

  2. After encoding to utf-8, convert the entire csv string to UTF-16LE :

    mb_convert_encoding($csv_content, 'UTF-16LE', 'UTF-8');

  3. Prefix the csv string with a little-endian byte-order-mark (LE BOM) :

    $csv_content = chr(255) . chr(254) . $csv_content;

And this should do it.

Upvotes: 2

Related Questions