Reputation: 5204
I'm working on a csv import script in php. It works fine, except for foreign characters in the beginning of a field.
The code looks like this
if (($handle = fopen($filename, "r")) !== FALSE)
{
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE)
$teljing[] = $data;
fclose($handle);
}
Here is a data example showing my issue
føroyskir stavir, "Kr. 201,50"
óvirkin ting, "Kr. 100,00"
This will result in the following
array
(
[0] => array
(
[0] => 'føroyskir stavir',
[1] => 'Kr. 201,50'
)
[1] => array
(
[0] => 'virkin ting', <--- Should be 'óvirkin ting'
[1] => 'Kr. 100,00'
)
)
I have seen this behaivior documented in some comments in php.net, and I have tried ini_set('auto_detect_line_endings',TRUE);
to detect line endings. No success.
Anyone familiar with this issue?
Edit:
Thanks you AJ, this issue is now solved.
setlocale(LC_ALL, 'en_US.UTF-8');
Was the solution.
Upvotes: 5
Views: 3019
Reputation: 28184
From the PHP manual for fgetcsv()
:
"Note: Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function."
Upvotes: 6
Reputation: 4778
Copied from the PHP.net/fgetcsv comments:
kent at marketruler dot com 04-Feb-2010 11:18 Note that fgetcsv, at least in PHP 5.3 or previous, will NOT work with UTF-16 encoded files. Your options are to convert the entire file to ISO-8859-1 (or latin1), or convert line by line and convert each line into ISO-8859-1 encoding, then use str_getcsv (or compatible backwards-compatible implementation). If you need to read non-latin alphabets, probably best to convert to UTF-8.
See str_getcsv for a backwards-compatible version of it with PHP < 5.3, and see utf8_decode for a function written by Rasmus Andersson which provides utf16_decode. The modification I added was that the BOP appears at the top of the file, then not on subsequent lines. So you need to store the endian-ness, and then re-send it upon each subsequent line decoding. This modified version returns the endianness, if it's not available:
<?php
/**
* Decode UTF-16 encoded strings.
*
* Can handle both BOM'ed data and un-BOM'ed data.
* Assumes Big-Endian byte order if no BOM is available.
* From: http://php.net/manual/en/function.utf8-decode.php
*
* @param string $str UTF-16 encoded data to decode.
* @return string UTF-8 / ISO encoded data.
* @access public
* @version 0.1 / 2005-01-19
* @author Rasmus Andersson {@link http://rasmusandersson.se/}
* @package Groupies
*/
function utf16_decode($str, &$be=null) {
if (strlen($str) < 2) {
return $str;
}
$c0 = ord($str{0});
$c1 = ord($str{1});
$start = 0;
if ($c0 == 0xFE && $c1 == 0xFF) {
$be = true;
$start = 2;
} else if ($c0 == 0xFF && $c1 == 0xFE) {
$start = 2;
$be = false;
}
if ($be === null) {
$be = true;
}
$len = strlen($str);
$newstr = '';
for ($i = $start; $i < $len; $i += 2) {
if ($be) {
$val = ord($str{$i}) << 4;
$val += ord($str{$i+1});
} else {
$val = ord($str{$i+1}) << 4;
$val += ord($str{$i});
}
$newstr .= ($val == 0x228) ? "\n" : chr($val);
}
return $newstr;
}
?>
Trying the "setlocale" trick did not work for me, e.g.
<?php
setlocale(LC_CTYPE, "en.UTF16");
$line = fgetcsv($file, ...)
?>
But that's perhaps because my platform didn't support it. However, fgetcsv only supports single characters for the delimiter, etc. and complains if you pass in a UTF-16 version of said character, so I gave up on that rather quickly.
Hope this is helpful to someone out there.
Upvotes: 0