Michael Teper
Michael Teper

Reputation: 4651

How to read in UTF8+BOM file using PHP and not have the BOM appear as content?

Pretty much what the question says. I've found lots of recommendations for how to strip the byte order mark once the text is read in, but that seems wrong. Isn't there a standard way in the language to read in a Unicode file with proper recognition and treatment of the BOM?

Upvotes: 4

Views: 2807

Answers (2)

iProDev
iProDev

Reputation: 603

had the same problem. my function _fread() will remove the bom and solved the issue for me...

/**
 * Read local file
 * @param   file   local filename
 * @return  Data from file, or false on failure
 */
function _fread ($file = null) {
    if ( is_readable($file) ) {
        if ( !($fh = fopen($file, 'r')) ) return false;
        $data = fread($fh, filesize($file));

        // remove bom
        $bom = pack('H*','EFBBBF');
        $data = preg_replace("/^$bom/", '', $data);

        fclose($fh);
        return $data;
    }
    return false;
}

Upvotes: 4

bobince
bobince

Reputation: 536389

Nope. You have to do it manually.

The BOM is part of signalling byte order in the UTF-16LE and UTF-16BE encodings, so it makes some sense for UTF-16 decoders to remove it automatically (and so many do).

However UTF-8 always has the same byte order, and aims for ASCII compatibility, so including a BOM was never envisaged as part of the encoding scheme as specified, and so really it isn't supposed to receive any special treatment from UTF-8 decoders.

The UTF-8 faux-BOM is not part of the encoding, but an ad hoc (and somewhat controversial) marker some (predominantly Microsoft) applications use to signal that the file is probably UTF-8. It's not a standard in itself, so specifications that build on UTF-8, like XML and JSON, have had to make special dispensation for it.

Upvotes: 5

Related Questions