Dr. Agon
Dr. Agon

Reputation: 87

Two question marks in diamonds instead of upsidedown exclamation mark

I'm processing some text files with Spanish text in php with eclipse-php on my Mac OS X 10. I have the encoding set to UTF-8, and everything works great except for one small problem. All of the ¡ (upside-down exclamation marks) are replaced with � � (two black diamonds with questions marks separated by a space) in the output text file. None of the other characters (¿ñáéíóúü) are giving me any trouble. I had a similar problem with my Windows Vista machine (it would replace all ¡ with é). Any ideas why this one character is bugging out in UTF-8 and how I can fix it?

Here's the code I'm using. I didn't include it originally because it is so long and I'm not sure where the problem lies. As you can see I've tried to incorporate shiplu.mokadd.im's suggestion, but I'm still getting the � �.

<?php

ini_set("auto_detect_line_endings", true);

$sourceH = fopen("MainInput.txt", "r") or die("Can't open MainInput.txt.");
$sourceData = array();
$tracker = 0;

while (!feof($sourceH)){
    $sourceData[$tracker] = fgets($sourceH);
    $sourceData[$tracker] = preg_split("/\t/", $sourceData[$tracker]);
    $tracker++;
}

$i = $tracker--;

$chars_hi = 'ABCDEFGHIJKLMNÑOPQRSTUVWXYZÁÉÍÓÚÜ';
$chars_lo = 'abcdefghijklmnñopqrstuvwxyzáéíóúü';
$characters = "ABCDEFGHIJKLMNÑOPQRSTUVWXYZÁÉÍÓÚÜabcdefghijklmnñopqrstuvwxyzáéíóúü1234567890'-";

function lowercase($s) {
    global $chars_hi, $chars_lo;
    return strtr($s, $chars_hi, $chars_lo);
}

$myNewFile = "Processing/Prepared.txt";
$fhNew = fopen($myNewFile, 'w') or die("can't open Prepared\n");
$newText = "";

for ($n = 1; $n < $i; $n++) {

    $myFile = $sourceData[$n][1];
    $fh = fopen($myFile,'r') or die("can't open file ".$sourceData[$n][1]."\n");
    fwrite($fhNew, "\n\nStartFile ".$sourceData[$n][0]."\n\n");
    $position = 0;
    $speaker = ">>u";

    while (!feof($fh)){
        $newText = fgets($fh);
        $isLast = false;
        $isFirst = true;
        $new = "";
        if (mb_strpos($newText, ">> i") !== false or mb_strpos($newText, ">>i") !== false or mb_strpos($newText, ">i") !== false or mb_strpos($newText, "> i") !== false) {
            $speaker = ">>i";
        }
        elseif (mb_strpos($newText, ">> s") !== false or mb_strpos($newText, ">>s") !== false or mb_strpos($newText, ">s") !== false or mb_strpos($newText, "> s") !== false) {
            $speaker = ">>s";
        }
        for ($in = 0; $in < mb_strlen($newText); $in++) {
            if (mb_strpos($characters, $newText[$in]) !== false) {
                if ($isFirst == true) {
                    $new = $new." ".$newText[$in];
                    $isFirst = false;
                    $isLast = true;
                }
                else {
                    $new = $new.$newText[$in];
                }
            }
            elseif ($isLast == true) {
                $isLast = false;
                $isFirst = true;
                $new = $new."   ".($in + $position)."   ".$speaker."    ".$newText[$in];
            }
            else {
                $new = $new.$newText[$in];
            }
        }
        $position += mb_strlen($newText);
        $newText = $new;
        $newText = lowercase($newText);
        fwrite($fhNew, $newText."\n");
    }
    fclose($fh);
}
fclose($fhNew);

?>

Upvotes: 2

Views: 1491

Answers (1)

Esailija
Esailija

Reputation: 140220

You cannot do stuff like this:

$new = $new." ".$newText[$in];

Specifically, $newText[$in]. That does byte level access, but when using UTF-8, characters consist of multiple bytes. So when you hack and slash bytes like this, you will separate the UTF-8 bytes that belong together, resulting in .

For example, run this PHP script (Saved in text editor as UTF-8):

<?php
header("Content-Type: text/html; charset=UTF-8");
$text = "ä";
echo $text[0] . " " . $text[1];

The result is � �.

You must fix all of your code where you are doing [] access on strings. You can replace $string[$i] with mb_substr( $string, $i, 1, "UTF-8" );

Also, have you set mb_internal_encoding to "UTF-8"? Otherwise it will most likely not default to UTF-8 when you call mb_* functions without explicit encoding.

I also recommend using something like mb_convert_case($str, MB_CASE_LOWER, "UTF-8"); over your custom lowercase function.

Upvotes: 5

Related Questions