Jesse Brands
Jesse Brands

Reputation: 2877

Reading a file with unknown UTF8 strings and known ASCII mixed

Sorry for the confusing title, I am not really sure how to word this myself. I will try and keep my question as simple as possible.

I am working on a system that keeps a "catalog" of strings. This catalog is just a simple flat text file that is indexed in a specific manner. The syntax of the files has to be in ASCII, but the contents of the strings can be UTF8.

Example of a file:

{
    STRINGS: {
        THISHASTOBEASCII: "But this is UTF8"
        HELLO1: "Hello, world"
        HELLO2: "您好"
    }
}

Reading a UTF8 file isn't the problem here, I don't really care what's between the quotation marks as it's simply copied to other places, no changes are made to the strings.

The problem is that I need to parse the bracket and the labels of the strings to properly store the UTF8 strings in memory. How would I do this?

EDIT: Just realised I'm overcomplicating it. I should just copy and store whatever is between the two "", as UTF8 can be read into bytes >_<. Marked for closing.

Upvotes: 1

Views: 300

Answers (2)

Remy Lebeau
Remy Lebeau

Reputation: 596407

ASCII is a subset of UTF-8, and UTF-8 can be processed using standard 8-bit string parsing functions. So the entire file can be processed as UTF-8. Just strip off the portions you do not need.

Upvotes: 1

BigTailWolf
BigTailWolf

Reputation: 1028

You can do it just in your UTF-8 processing method which you mentioned.

Actually, one byte UTF-8 characters also follow the ASCII rule.

1 Byte UTF-8 are like 0XXXXXXX. For more bytes UTF-8. The total bytes is start with ones followed by a zero and then other bytes start with 10.

Like 3-bytes: 1110XXXX 10XXXXXX 10XXXXXX

5-bytes: 111110XX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX

When you go through the character array, just check each char you read. You will know whether it's an ASCII (by & 0x80 get false) or a part of multi-bytes character (by & 0x80 get true)

Note: All the unicode are 3-byte UTF-8. Unicode currently use 2 valid bytes (16 bits) and 3-byte UTF-8 is also 16 valit bits.(See the counts of 'X' I listed above)

Upvotes: 2

Related Questions