Reputation: 20456
I'm processing a text file output by a scientific instrument. I do not have documentation about how the file is produced. But I've discovered that it is full of invisible characters and characters that look normal but aren't. I read the file into an array and try to clean it up. Here's my process (showing only the first 4 lines of the file).
$datarr=file($_FILES['gcfile']['tmp_name'],FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
(note BOM, EOL is still there, length of strings is too large):
array(4) {
[0]=>
string(168) "ÿþC:\CHEM32\2\DATA\20120120 DA KLR\20120120 DA KLR 2012-01-20 09-55-35\P21K1000001.D
"
[1]=>
string(33) "tryptophan + DA
"
[2]=>
string(55) "Number of Peaks found 2
"
[3]=>
string(63) "" 1 , 13.08 , 36.29 "
"
}
$datarr[0]=removeBOM($datarr[0]);//remove byte order mark at beginning of file
$options=array(FILTER_FLAG_STRIP_HIGH, FILTER_FLAG_STRIP_LOW);
$patterns=array('/\pC/','/\'/', '/\"/');
array_walk($datarr,function(&$v) use($options, $patterns){
$v=filter_var($v,FILTER_SANITIZE_STRING, $options);
$v=trim(preg_replace($patterns,'',$v));
});
(note retention of double quotes on $datarr[3], length of strings ~= visible length, BOM gone)
array(4) {
[0]=>
string(82) "C:\CHEM32\2\DATA\20120120 DA KLR\20120120 DA KLR 2012-01-20 09-55-35\P21K1000001.D"
[1]=>
string(15) "tryptophan + DA"
[2]=>
string(26) "Number of Peaks found 2"
[3]=>
string(38) "" 1 , 13.08 , 36.29 ""
$datarr[3], though much improved, still has a reported length greater than it's visible length, and the " marks weren't removed. If I output the string as ascii numbers:
$l=strlen($datarr[3]);
for($i=0;$i<$l;$i++){
echo ord($datarr[3][$i]), ", ";
}
echo PHP_EOL;
$x= '" 1 , 13.08 , 36.29 "
';//copied from webpage output
$l=strlen($x);
for($i=0;$i<$l;$i++){
echo ord($x[$i]), ", ";
}
this is what I get:
38, 35, 51, 52, 59, 32, 32, 49, 32, 32, 44, 32, 32, 49, 51, 46, 48, 56, 32, 44, 32, 32, 32, 32, 32, 32, 32, 51, 54, 46, 50, 57, 32, 38, 35, 51, 52, 59, //original string
34, 32, 32, 49, 32, 32, 44, 32, 32, 49, 51, 46, 48, 56, 32, 44, 32, 32, 32, 32, 32, 32, 32, 51, 54, 46, 50, 57, 32, 34, 10, 9, //pasted from browser string
What do I have and what can I do about it?
Upvotes: 1
Views: 85
Reputation: 14479
I think I see what's wrong here. Reversing your output:
$array1 = array(38, 35, 51, 52, 59, 32, 32, 49, 32, 32, 44, 32, 32, 49, 51, 46, 48, 56, 32, 44, 32, 32, 32, 32, 32, 32, 32, 51, 54, 46, 50, 57, 32, 38, 35, 51, 52, 59);
$array2 = array(34, 32, 32, 49, 32, 32, 44, 32, 32, 49, 51, 46, 48, 56, 32, 44, 32, 32, 32, 32, 32, 32, 32, 51, 54, 46, 50, 57, 32, 34, 10, 9);
foreach($array1 as $char){
echo chr($char);
}
echo PHP_EOL;
foreach($array2 as $char){
echo chr($char);
}
we get:
" 1 , 13.08 , 36.29 "
" 1 , 13.08 , 36.29 "
So clearly the issue is the encoding of double-quotes (hence the string(38)
length when we expect string(30)
). This stems from you filter_var()
call. The FILTER_SANITIZE_STRING
filter will encode quotes. If you want to stop this from happening, you need to add the FILTER_FLAG_NO_ENCODE_QUOTES
flag to your options list. This should prevent the encoding of quotes and leave you with the expected string:
$options=array(FILTER_FLAG_NO_ENCODE_QUOTES,FILTER_FLAG_STRIP_HIGH, FILTER_FLAG_STRIP_LOW);
Upvotes: 2