testing
testing

Reputation: 20279

Check if csv file is in UTF-8 with PHP

Is there a way which checks a CSV-file for UTF-8 without BOM encoding? I want to check the whole file and not a single string.

I would try to set the first line with a special character and than reading the string and checking if it matches the same string hard-coded in my script. But I don't know if this is a good idea.

Google only showed me this. But the link in the last post isn't available.

Upvotes: 1

Views: 11815

Answers (2)

Damien
Damien

Reputation: 5882

I recommand this function (from the symfony toolkit):

<?php
  /**
   * Checks if a string is an utf8.
   *
   * Yi Stone Li<[email protected]>
   * Copyright (c) 2007 Yahoo! Inc. All rights reserved.
   * Licensed under the BSD open source license
   *
   * @param string
   *
   * @return bool true if $string is valid UTF-8 and false otherwise.
   */
  public static function isUTF8($string)
  {
    for ($idx = 0, $strlen = strlen($string); $idx < $strlen; $idx++)
    {
      $byte = ord($string[$idx]);

      if ($byte & 0x80)
      {
        if (($byte & 0xE0) == 0xC0)
        {
          // 2 byte char
          $bytes_remaining = 1;
        }
        else if (($byte & 0xF0) == 0xE0)
        {
          // 3 byte char
          $bytes_remaining = 2;
        }
        else if (($byte & 0xF8) == 0xF0)
        {
          // 4 byte char
          $bytes_remaining = 3;
        }
        else
        {
          return false;
        }

        if ($idx + $bytes_remaining >= $strlen)
        {
          return false;
        }

        while ($bytes_remaining--)
        {
          if ((ord($string[++$idx]) & 0xC0) != 0x80)
          {
            return false;
          }
        }
      }
    }

    return true;
  }

But as it check all the characters of the string, I don't recommand to use it on a large file. Just check the first 10 lines i.e.

<?php
$handle = fopen("mycsv.csv", "r");
$check_string = "";
$line = 1;
if ($handle) {
    while ((($buffer = fgets($handle, 4096)) !== false) && $line < 11) {
        $check_string .= $buffer;
        $line++;
    }
    if (!feof($handle)) {
        echo "Error: unexpected fgets() fail\n";
    }
    fclose($handle);

    var_dump( self::isUTF8($check_string) );
}

Upvotes: 5

deceze
deceze

Reputation: 522135

if (mb_check_encoding(file_get_contents($file), 'UTF-8')) {
    // yup, all UTF-8
}

You can also go through it line by line with fgets, if the file is large and you don't want to store it all in memory at once. Not sure what you mean by the second part of your question.

Upvotes: 13

Related Questions