Sébastien MICHOY
Sébastien MICHOY

Reputation: 141

Encoding UTF-8 characters in QR Code symbols

Since iOS 7, it is possible to generate a QR Code via the CIFilter named CIQRCodeGenerator of Core Image framework.

By looking the documentation, Apple indicates that strings used to generate QR Code must be encoded with NSISOLatin1StringEncoding.

To create a QR code from a string or URL, convert it to an NSData object using the NSISOLatin1StringEncoding string encoding.

However, I tried to encode Chinese characters with NSUTF8StringEncoding and it works pretty well. Do you think I can have problems by using NSUTF8StringEncoding? Are there any known issues?

Upvotes: 4

Views: 8404

Answers (2)

Terry Burton
Terry Burton

Reputation: 3030

What follows is general advice and somebody that has knowledge of the Core Image framework may be able to provide a more specific answer. Nevertheless, I hope it clarifies why the library provides such specific encoding advice, the likely consequences of ignoring that advice, and how you might nevertheless encode characters that are not available through Latin-1.

In general, the ISO/IEC 18004 standard for QR Code ("QR Code 2005"), and all other international standards for 2D barcodes, specify that the Latin-1 character encoding must be used when interpreting the QR Code byte sequence returned by readers, except where an Extended Channel Interpretation (ECI) sequence specifying an alternative character encoding has been provided in the data.

It is however so common for users to encode the data using UTF-8 that in practise most barcode readers use a proprietary heuristic to guess whether the content is encoded according in some other encoding than Latin-1, such as UTF-8. In many cases this leads to ambiguity and will result in misreads especially when arbitrary data is used in open applications.

If you intend to be rigorous and it is required that the data be encoded using UTF-8 then it is necessary for the encoding library to support setting ECI 000026 before the UTF-8 data.

Edit 2020: I have produced a detail article describing precisely this issue and the work that is currently being undertaken by the standards bodies to promote the use of ECI: https://www.linkedin.com/pulse/enhanced-channel-interpretation-terry-burton/

The register of assigned ECI codes is available from the AIM store as "ECI Part 3: Register" for a fee.

[*] With CIQRCodeGenerator this does not appear to be the case.

Upvotes: 4

Maxim Masiutin
Maxim Masiutin

Reputation: 4782

The “ISO-8859-1” is the default encoding for QR codes.

There are 4 modes of storing text in a QR Code:

  • numeric (0-9);
  • alphanumeric (numeric plus uppercase A-Z, space and eight punctuation characters) – 45 characters in total;
  • 8-bit (by default, “ISO-8859-1” encoded text);
  • Kanji (“Shift_JIS” encoded JIS X 0208 characters in ranges 8140 - 9FFC and E040-EBBF).

That’s why NSISOLatin1StringEncoding is used for CIQRCodeGenerator – to accommodate the 8-bit text encoding mode in the QR Code, since “ISO-8859-1” is the default encoding in QR code.

To use UTF-8 encoding instead of the default “ISO-8859-1” in the 8-bit string, the implementation have to insert an ECI (Extended Channel Interpretations) before the string.

ECI is an optional, additional feature for a QR Code. ECI enables data encoding using character sets other than the default. It also enables other data interpretations (e.g. compacted data using defined compression schemes) or other industry-specific requirements to be encoded.

The ECI protocol is fully defined in the AIM ECI specification (developed by AIM, Inc - 20399 Route 19, Suite 203, Cranberry Township, Pennsylvania 16066 USA). It is a different specification than the QR Code specification. The ECI protocol provides a method to specify particular interpretations of byte values before printing and after decoding. The specification is available at $50 at https://www.aimglobal.org/technical-symbology.html

Unfortunately, not all QR decoder implementations can handle the ECI protocol, even in such a basic thing as changing default encoding to UTF-8. Most implementations use one or another character encoding detection algorithm for guessing the encoding, even if the encoding is specified explicitly in the ECI of the decoded QR code.

The need to use a detection algorithm did probably arise from the fact that the initial QR code standard published in 2000 (ISO/IEC 18004:2000) did specify 8-bit Latin/Kana character set in accordance with JIS X 0201 (JIS8 also known as ISO-2022-JP) as default encoding for a 8-bit mode, while the updated standard published in 2005 did change the default to ISO-8859-1.

For example, Xiaomi phones with MIUI Global v11.0.3 cannot correctly show a string of Cyrillic characters encoded in UTF-8 even if this encoding is specified by means of ECI. The Cyrillic characters are shown as question marks. But if you add a Chinese/Japanese character (e.g. 日) to the Cyrillic text, the whole text will be displayed correctly by Xiaomi.

A set of best practices for QR decoders should be agreed upon. The set should prescribe that in case when an ECI extension is given to specify the character encoding, no character encoding detection algorithm should be used to override specified encoding.

Upvotes: 2

Related Questions