Automatic Assignment of Character Encoding?

Question

I am confused about how the encoding of output files is being set.

I have a test file with content "qwe" (one character per line). I have tested a few ISO-x encodings. I read the file and produce an output file. But the output file is always encoded in UTF-8. That in itself is confusing, as I never explicitly wrote code to make the output file UTF-8 encoded. More confusing is, that in a different program I have as input UTF-8 and get as output some ISO encoding... again, without me telling to do change he encoding.

This is my test code:

#include 
#include 

using namespace std;

int main(){

    string in_file = "in.txt"; // some ISO encoding (e.g.)
    ifstream in(in_file.c_str());
    ofstream out;
    out.open("out.txt");
    while (in.good()) {
        std::string line;
        getline(in, line);
        out << line << endl;
    }
    out.close(); // output file is in UTF-8

}

The code of the other program that produces some ISO with UTF-8 input is very long and I could not find where the difference is between the test program and my actual one. But maybe understanding why the test program behaves in the way it does, already enables me to figure out the issue with the other one.

So, basically my question is, why is the output file set to UTF-8, or what determines the encoding of ofstream objects.

EDIT:

Okay, so I made my actual code a little more handy, so I can now more easily show it to you.

So, I got two functions operating at the surface level constructing a trie from an input list, which also contain the code to generate DOT-code for graphviz.

    /*
     *
     * name: make_trie
     * @param trie Trie to build
     * @param type Type of tokens to build trie for
     * @param gen_gv_code Determines wheter to generate graphviz-code
     *  (for debug and maintanance purposes)
     * @return
     *
     */
    bool make_trie(Trie* trie, std::string type, bool gen_gv_code=false){
        if (gen_gv_code){
            gv_file
            << "digraph {
	"
            << "rankdir=LR;
	"
            << "node [shape = circle];
	"
            << "1 [label=1]
	"
            << "node [shape = point ]; Start
	"
            << "Start -> 1
		";
        }
        Element* current = wp_list->iterate();
        state_symbol_tuple* sst;
        std::string token = ""; // token to add to trie
        // once the last entry in the input list is encountered, make_trie()
        // needs to run for as many times as that entry has letters +1 - the
        // number of letters of taht stringa lready encoded into the trie to
        // fully encode it into it.
        bool last_token = false;
        bool incr = false;
        while (true){
            if (type == "tag") { token = current->get_WPTuple_tag(); }
            else if (type == "word") { token = current->get_WPTuple_word(); }
            else {
                cerr
                << "Error (trainer.h):"
                << "Unkown type '"
                << type
                << "'. Token has not been assigned."
                << endl;
                abort();
            }
            // last_state is pointer to state the last transition in the trie
            // that matched the string led to
            sst = trie->find_state(token);
            incr = trie->add(current, sst, gv_file, gen_gv_code);
            // as soon as the last token has been encoded into the trie, break
            if (last_token && sst->existing) { break; }
            // go to the next list item only once the current one is represented
            // in the trie
            if (incr) {
                // Once a word has been coded into the trie, go to the next word.
                // Only iterate if you are not at the last elememt, otherwise
                // you start at the front of the list again.
                if (current->next != 0){
                    current = wp_list->iterate(); incr = false;
                }
            }
            // enable the condition for the last token, as this is a boundary
            // case
            if (current->next == 0) { last_token = true;}
            // free up memory allocated for current sst
            delete sst;
        }
        if (gen_gv_code){
            gv_file << "}";
            gv_file.close();
        }
        return true;
    }


/*
 *
 * name: Trie::add
 * @details Encodes a given string into the trie. If the string is not
 *  in the trie yet, it needs to be passed to this function as many
 *  times as it has letters +1.
 * @param current list element
 * @param sst state_symbol_tuple containing information on the last
 *  state that represents the string to be encoded up to some point.
 *  Also contains the string itself.
 * @return returns boolean, true if token is already represented
 *  in trie, false else
 *
 */
bool Trie::add(Element* current, state_symbol_tuple* sst, \
    std::ofstream &gv_file_local, bool gen_gv_code){
    if (current != 0){
        // if the word is represented in the trie, increment its counter
        // and go to the next word in the list
        if (sst->existing){
            (((sst->state)->find_transition(sst->symbol))->get_successor())->increment_occurance();
            if (gen_gv_code){
                gv_file_local
                << (((sst->state)->find_transition(sst->symbol))->get_successor())->get_id()
                << "[shape = ellipse label = \""
                << (((sst->state)->find_transition(sst->symbol))->get_successor())->get_id()
                << "
occ: "
                << (((sst->state)->find_transition(sst->symbol))->get_successor())->get_occurance()
                //~ << "
addr: "
                //~ << ((sst->state)->find_transition(sst->symbol))->get_successor()
                << "\" peripheries=2]
		";
            }
            return true;
        }
        // if the current string is a substring of one already enoced into
        // the trie, make the substring an accepted one too
        else if (sst->is_substring){
            (((sst->state)->find_transition(sst->symbol))->get_successor()) \
            ->make_accepting();
        }
        // if the word isn't represented in the trie, make a transition
        // for the first character of the word that wasn't represented
        // and then look for the word anew, until it *is* represented.
        else {
            (sst->state)->append_back(sst->symbol);
            // as the new transition has been appended at the back
            // "last" is that new transition
            // make an empty successor state that the new transition
            // points to
            ((sst->state)->get_last())->make_successor();
            // increment total state count
            increment_states_total();
            // give the newly added state a unique ID, by making its ID
            // the current number of states
            (((sst->state)->get_last())->get_successor())->set_id(get_states_total());
            if (gen_gv_code){
                gv_file_local << (sst->state)->get_id() << " -> " << get_states_total()
                                            << "[label=\"";
                if (sst->symbol == '"') {
                    gv_file_local << "#";
                }
                else{
                    gv_file_local << sst->symbol;
                }
                gv_file_local << "\"]
		";
            }
            get_states_total();
            // if the length of the input string -1 is equal to the
            // index of the last symbol, that was processed, then that
            // was the last symbol of the string and the new state needs
            // to become an accepting one
            if (sst->index == (sst->str_len-1)){
                // access the newly created successor state
                // define it as an accepting state
                (((sst->state)->get_last())->get_successor())->make_accepting();
            }
            else if (gen_gv_code){
                gv_file_local
                << get_states_total()
                << "[shape = circle label = \""
                << (((sst->state)->find_transition(sst->symbol))->get_successor())->get_id()
                //~ << "
addr: "
                //~ << ((sst->state)->find_transition(sst->symbol))->get_successor()
                << "\"]
		";
            }
        }
    } else { cerr << "list to build trie from is empty" << endl; abort();}
    return false;
}

The output file is opened as follows:

gv_file.open("gv_file");

And the passed on like this:

make_trie(trie_words, "word", true);

As this is about issues with encoding, the details pf the implementation don'z matter, only the bits where the DOT-code is written to the output file.

My test-input is this (in UTF-8):

ascii-range

ütf-8-ränge

my output is this (in ISO-8859)

    digraph {
    rankdir=LR;
    node [shape = circle];
    1 [label=1]
    node [shape = point ]; Start
    Start -> 1
        1 -> 2[label="a"]
        2[shape = circle label = "2"]
        2 -> 3[label="s"]
        3[shape = circle label = "3"]
        3 -> 4[label="c"]
        4[shape = circle label = "4"]
        4 -> 5[label="i"]
        5[shape = circle label = "5"]
        5 -> 6[label="i"]
        6[shape = circle label = "6"]
        6 -> 7[label="-"]
        7[shape = circle label = "7"]
        7 -> 8[label="r"]
        8[shape = circle label = "8"]
        8 -> 9[label="a"]
        9[shape = circle label = "9"]
        9 -> 10[label="n"]
        10[shape = circle label = "10"]
        10 -> 11[label="g"]
        11[shape = circle label = "11"]
        11 -> 12[label="e"]
        12[shape = ellipse label = "12
occ: 1" peripheries=2]
        1 -> 13[label="Ã"]
        13[shape = circle label = "13"]
        13 -> 14[label="Œ"]
        14[shape = circle label = "14"]
        14 -> 15[label="t"]
        15[shape = circle label = "15"]
        15 -> 16[label="f"]
        16[shape = circle label = "16"]
        16 -> 17[label="-"]
        17[shape = circle label = "17"]
        17 -> 18[label="8"]
        18[shape = circle label = "18"]
        18 -> 19[label="-"]
        19[shape = circle label = "19"]
        19 -> 20[label="r"]
        20[shape = circle label = "20"]
        20 -> 21[label="Ã"]
        21[shape = circle label = "21"]
        21 -> 22[label="€"]
        22[shape = circle label = "22"]
        22 -> 23[label="n"]
        23[shape = circle label = "23"]
        23 -> 24[label="g"]
        24[shape = circle label = "24"]
        24 -> 25[label="e"]
        25[shape = ellipse label = "25
occ: 1" peripheries=2]
        }

So yea... how could I ensure my output is encoded in utf8 as well?

Alan Stokes · Accepted Answer

In UTF-8 some characters are encoded as more than one byte. For example ä requires two bytes to encode. Your code for reading the string is completely ignoring this and assuming one byte per character. You are then outputting the bytes separately; that's not legal UTF-8, so whatever you're using to work out the character set is deducing it must be ISO-8859.

Specifically, the two characters Ã then € encoded in ISO-8859 are exactly the same as the 2 bytes that encode ä in UTF-8.

If, as I suggested some time ago, you looked at the raw bytes this would be more apparent.

Automatic Assignment of Character Encoding?

EDIT:

Answers (1)

Related Questions