Benjen
Benjen

Reputation: 2925

How to read multibyte characters from a CSV file using PHP

I have a CSV file which contains a mixture of English and Chinese characters (it is a list of contacts exported from the Mozilla Thunderbird email program). I am trying to create a function which can extract the information from this file. It appears that function fgetcsv() does not support multibyte characters. Since I am running PHP5.2, I do not have access to str_getcsv().

Although the situation above refers to English and Chinese, I am looking for a solution which will work with any language.

Right now I have the function namecards_import_str_getcsv() as my CSV parsing function, which tries to mimic str_getcsv().

function namecards_import_str_getcsv($input, $delimiter = ',', $enclosure = '"', $escape = '\\', $eol = '\n') {
  if (!function_exists('str_getcsv')) {
    if (is_string($input) && !empty($input)) {
      $output = array();
      $tmp    = preg_split("/".$eol."/",$input);
      if (is_array($tmp) && !empty($tmp)) {
        while (list($line_num, $line) = each($tmp)) {
          if (preg_match("/" . $escape . $enclosure . "/", $line)) {
            while ($strlen = strlen($line)) {
              $pos_delimiter = strpos($line, $delimiter);
              $pos_enclosure_start = strpos($line, $enclosure);
              if (is_int($pos_delimiter) && is_int($pos_enclosure_start) && ($pos_enclosure_start < $pos_delimiter)) {
                $enclosed_str = substr($line, 1);
                $pos_enclosure_end = strpos($enclosed_str, $enclosure);
                $enclosed_str = substr($enclosed_str, 0, $pos_enclosure_end);
                $output[$line_num][] = $enclosed_str;
                $offset = $pos_enclosure_end + 3;
              } 
              else {
                if (empty($pos_delimiter) && empty($pos_enclosure_start)) {
                  $output[$line_num][] = substr($line, 0);
                  $offset = strlen($line);
                } 
                else {
                  $output[$line_num][] = substr($line,0,$pos_delimiter);
                  $offset = (!empty($pos_enclosure_start) && ($pos_enclosure_start < $pos_delimiter))? $pos_enclosure_start : $pos_delimiter + 1;
                }
              }
              $line = substr($line,$offset);
            }
          } 
          else {
            $line = preg_split("/" . $delimiter . "/", $line);

            /*
             * Validating against pesky extra line breaks creating false rows.
            */
            if (is_array($line) && !empty($line[0])) {
              $output[$line_num] = $line;
            }
          }
        }
        return $output;
      } 
      else {
        return false;
      }
    } 
    else {
      return false;
    }
  }
  else {
    return str_getcsv($input);
  }
}

This function is called by the following line of code:

  $file = $_SESSION['namecards_csv_file'];

  if (file_exists($file->uri)) {
    // Load raw csv content into a handler variable.
    $handle = fopen($file->uri, "r");
    $cardinfo = array();
    while (($data = fgets($handle)) !== FALSE) {
      $data = namecards_import_str_getcsv($data);
      dsm($data);
      $cardinfo[] = $data[0];
    }
    fclose($handle);
  }
  else {
    drupal_set_message(t('CSV file doesn\'t exist'), 'error');
  }

In the array of results the strings of Chinese characters are in the correct place in the array by they appear as symbols e.g. "��".

Another method I had tried before this was to simply use fgetcsv() (See below example). But in this case the elements of the returned array were empty.

$file = $_SESSION['namecards_csv_file'];

if (file_exists($file->uri)) {
  // Load raw csv content into a handler variable.
  $handle = fopen($file->uri, "r");
  $cardinfo = array();
  while (($data = fgetcsv($handle, 5000, ",")) !== FALSE) {
    dsm($data);
    $cardinfo[] = $data;
  }
  fclose($handle);
}
else {
  drupal_set_message(t('CSV file doesn\'t exist'), 'error');
}

In case you are interested here is the contents of the CSV file:

First Name,Last Name,Display Name,Nickname,Primary Email,Secondary Email,Screen Name,Work Phone,Home Phone,Fax Number,Pager Number,Mobile Number,Home Address,Home Address 2,Home City,Home State,Home ZipCode,Home Country,Work Address,Work Address 2,Work City,Work State,Work ZipCode,Work Country,Job Title,Department,Organization,Web Page 1,Web Page 2,Birth Year,Birth Month,Birth Day,Custom 1,Custom 2,Custom 3,Custom 4,Notes,
Ben,Gunn,Ben Gunn,Benny,[email protected],[email protected],,+94 (10) 11111111,+94 (10) 22222222,+94 (10) 33333333,,+94 44444444444,12 Benny Lane,,Beijing,Beijing,100028,China,13 asdfsdfs,,sdfsf,sdfsdf,134323,China,Manager,Sales,Benny Inc,,,,,,,,,,,
乔,康,乔 康,小康,,,,,,,,,,,,,,,北京市朝阳区,,,,,,,,,,,,,,,,,,,

Upvotes: 3

Views: 1845

Answers (1)

deceze
deceze

Reputation: 522402

Just writing up as an answer what was figured out in the comments:

fgetcsv is locale sensitive, so make sure to setlocale to a UTF-8 locale.

Upvotes: 3

Related Questions