Julio de Leon
Julio de Leon

Reputation: 1212

How to convert a file to UTF-8 in php?

Is it possible to convert a file into UTF-8 on my end?

If I have an access on the file after the submission with

$_FILES['file']['tmp_name']

Note: The user can upload a CSV file with any kind of charset, I usually encounter an unknown 8-bit charset.

I try

$row = array();
$datas = file($_FILES['file']['tmp_name']);
foreach($datas as $data) {
    $data = mb_convert_encoding($data, 'UTF-8');
    $row[] = explode(',', $data);
}

But the problem is, this code remove special characters like single quote.

My first question is htmlspecialchars remove the value inside the array?

I put it for additional information. Thanks for those who can help!

Upvotes: 10

Views: 21319

Answers (5)

Juergen Schulze
Juergen Schulze

Reputation: 1652

function convert_file_to_utf8($source, $target) {
    $content=file_get_contents($source);
    # detect original encoding
    $original_encoding=mb_detect_encoding($content, "UTF-8, ISO-8859-1, ISO-8859-15", true);
    # now convert
    if ($original_encoding!='UTF-8') {
        $content=mb_convert_encoding($content, 'UTF-8', $original_encoding);

    }
    $bom=chr(239) . chr(187) . chr(191); # use BOM to be on safe side
    file_put_contents($target, $bom.$content);
}

Upvotes: 6

odan
odan

Reputation: 4952

Let's try this:

function encode_utf8($data)
{
    if ($data === null || $data === '') {
        return $data;
    }
    if (!mb_check_encoding($data, 'UTF-8')) {
        return mb_convert_encoding($data, 'UTF-8');
    } else {
        return $data;
    }
}

Usage:

$content = file_get_contents($_FILES['file']['tmp_name']);
$content = encode_utf8($content);

$rows = explode("\n", $content);
foreach ($rows as $row) {
    print_r($row);
}

Upvotes: 2

hanshenrik
hanshenrik

Reputation: 21483

before you can convert it to utf-8, you need to know what characterset it is. if you can't figure that out, you can't in any sane way convert it to utf8.. however, an insane way to convert it to utf-8, if the encoding cannot be determined, is to simply strip any bytes that doesn't happen to be valid in utf-8, you might be able to use that as a fallback...

warning, untested code (im suddenly in a hurry), but may look something like this:

foreach ( $datas as $data ) {
    $encoding = guess_encoding ( $data );
    if (empty ( $encoding )) {
        // encoding cannot be determined...
        // as a fallback, we simply strip any bytes that isnt valid utf-8...
        // obviously this isn't a reliable conversion scheme.
        // also this could probably be improved
        $data = iconv ( "ASCII", "UTF-8//TRANSLIT//IGNORE", $text );
    } else {
        $data = mb_convert_encoding ( $data, 'UTF-8', $encoding );
    }
    $row [] = explode ( ',', $data );
}
function guess_encoding(string $str): string {
    $blacklist = array (
            'pass',
            'auto',
            'wchar',
            'byte2be',
            'byte2le',
            'byte4be',
            'byte4le',
            'BASE64',
            'UUENCODE',
            'HTML-ENTITIES',
            '7bit',
            '8bit' 
    );
    $encodings = array_flip ( mb_list_encodings () );
    foreach ( $blacklist as $tmp ) {
        unset ( $encodings [$tmp] );
    }
    $encodings = array_keys ( $encodings );
    $detected = mb_detect_encoding ( $str, $encodings, true );
    return ( string ) $detected;
}

Upvotes: 3

A.D.
A.D.

Reputation: 2372

you can convert the file text into binary data by using the following

FUNCTION bin2text($bin_str) 
{ 
    $text_str = ''; 
    $chars = EXPLODE("\n", CHUNK_SPLIT(STR_REPLACE("\n", '', $bin_str), 8)); 
    $_I = COUNT($chars); 
    FOR($i = 0; $i < $_I; $text_str .= CHR(BINDEC($chars[$i])), $i  ); 
    RETURN $text_str; 
} 

FUNCTION text2bin($txt_str) 
{ 
    $len = STRLEN($txt_str); 
    $bin = ''; 
    FOR($i = 0; $i < $len; $i  ) 
    { 
        $bin .= STRLEN(DECBIN(ORD($txt_str[$i]))) < 8 ? STR_PAD(DECBIN(ORD($txt_str[$i])), 8, 0, STR_PAD_LEFT) : DECBIN(ORD($txt_str[$i])); 
    } 
    RETURN $bin; 
}

after converting the data into binary you simply change the text to php method mb_convert_encoding($fileText, "UTF-8");

Upvotes: 1

BritishWerewolf
BritishWerewolf

Reputation: 3968

Try this out.
The example I have used was something I was doing in a test environment, you might need to change the code slightly.

I had a text file with the following data in:

test
café
áÁÁÁááá
žžœš¥±
ÆÆÖÖÖasØØ
ß

Then I had a form which took a file input in and performed the following code:

function neatify_files(&$files) {
    $tmp = array();
    for ($i = 0; $i < count($_FILES); $i++) {
        for ($j = 0; $j < count($_FILES[array_keys($_FILES)[$i]]["name"]); $j++) {
            $tmp[array_keys($_FILES)[$i]][$j]["name"] = $_FILES[array_keys($_FILES)[$i]]["name"][$j];
            $tmp[array_keys($_FILES)[$i]][$j]["type"] = $_FILES[array_keys($_FILES)[$i]]["type"][$j];
            $tmp[array_keys($_FILES)[$i]][$j]["tmp_name"] = $_FILES[array_keys($_FILES)[$i]]["tmp_name"][$j];
            $tmp[array_keys($_FILES)[$i]][$j]["error"] = $_FILES[array_keys($_FILES)[$i]]["error"][$j];
            $tmp[array_keys($_FILES)[$i]][$j]["size"] = $_FILES[array_keys($_FILES)[$i]]["size"][$j];
        }
    }
    return $files = $tmp;
}

if (isset($_POST["submit"])) {
    neatify_files($_FILES);
    $file = $_FILES["file"][0];

    $handle = fopen($file["tmp_name"], "r");
    while ($line = fgets($handle)) {
        $enc = mb_detect_encoding($line, "UTF-8", true);
        if (strtolower($enc) != "utf-8") {
            echo "<p>" . (iconv($enc, "UTF-8", $line)) . "</p>";
        } else {
            echo "<p>$line</p>";
        }
    }
}
?>
<form action="<?= $_SERVER["PHP_SELF"]; ?>" method="POST" enctype="multipart/form-data">
    <input type="file" name="file[]" />
    <input type="submit" name="submit" value="Submit" />
</form>

The function neatify_files is something I wrote to make the $_FILES array more logical in its layout.

The form is a standard form that simply POSTs the data to the server.
Note: Using $_SERVER["PHP_SELF"] is a security risk, see here for more.

When the data is posted I store the file in a variable. Obviously, if you are using the multiple attribute your code won't look quite like this.

$handle stores the entire contents of the text file, in a read-only format; hence the "r" argument.

$enc uses the mb_detect_encoding function to detect the encoding (duh).
At first I was having trouble with obtaining the correct encoding. Setting the encoding_list to use only UTF-8, and setting strict to be true.

If the encoding is UTF-8 then I simply print the line, if it didn't I converted it to UTF-8 using the iconv function.

Upvotes: 2

Related Questions