Richard
Richard

Reputation: 7433

fgets a UTF-8 txt file returns rubbish letters and true when file is blank

I assume that this is due to the UTF-8 txt file format. The txt file is totally empty and when I tried fgets($file_handle), I get these rubbish letters:

These weird letters

How do I fix this? I want to check if the file is empty by using:

if ( !$file_data = fgets($file_handle) )
    // This code runs if file is empty

EDIT

This is a new file using encoding UTF-8:

New File

Upvotes: 0

Views: 372

Answers (1)

Bananaapple
Bananaapple

Reputation: 3114

This has to do with the BOM (Byte Order Mark) added by Notepad to detect the encoding:

Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download.

From this article you can also see that:

The UTF-8 representation of the BOM is the (hexadecimal) byte sequence 0xEF,0xBB,0xBF

We should therefore be able to write a PHP function to account for this:

function is_utf8_file_empty($filename)
{
    $file = @fopen($filename, "r");
    $bom  = fread($file, filesize($filename));

    if ($bom == b"\xEF\xBB\xBF") {
        return true;
    }

    return false;
}

Do be aware that this is specific for files created in the manner you described and this is just example code - you should definitely test this and possible modify it to allow it to better handle large files / files that are completely empty etc

Upvotes: 2

Related Questions